Trending Articles
- AARS1 and AARS2 sense L-lactate to regulate cGAS as global lysine lactyltransferases. Li H, et al. Nature. 2024. PMID: 39322678
- Light-induced remodeling of phytochrome B enables signal transduction by phytochrome-interacting factor. Wang Z, et al. Cell. 2024. PMID: 39317197
- Transplantation of chemically induced pluripotent stem-cell-derived islets under abdominal anterior rectus sheath in a type 1 diabetes patient. Wang S, et al. Cell. 2024. PMID: 39326417
- Global, regional, and national burden of stroke and its risk factors, 1990-2021: a systematic analysis for the Global Burden of Disease Study 2021. GBD 2021 Stroke Risk Factor Collaborators. Lancet Neurol. 2024. PMID: 39304265
- Experience and Learning from the COVID-19 Pandemic in Portugal: Perceptions of Community Pharmacy Professionals. Advinha AM, et al. Port J Public Health. 2023. PMID: 38021255 Free PMC article.
Latest Literature
- Am J Med (5)
- Arch Phys Med Rehabil (1)
- Gastroenterology (2)
- J Am Acad Dermatol (1)
- J Biol Chem (1)
- Kidney Int (1)
- Nat Commun (143)
NCBI Literature Resources
MeSH PMC Bookshelf Disclaimer
The PubMed wordmark and PubMed logo are registered trademarks of the U.S. Department of Health and Human Services (HHS). Unauthorized use of these marks is strictly prohibited.
Unfortunately we don't fully support your browser. If you have the option to, please upgrade to a newer version or use Mozilla Firefox , Microsoft Edge , Google Chrome , or Safari 14 or newer. If you are unable to, and need support, please send us your feedback .
We'd appreciate your feedback. Tell us what you think! opens in new tab/window
Scopus: Comprehensive, multidisciplinary, trusted abstract and citation database
Quickly find relevant and authoritative research, identify experts and gain access to reliable data, metrics and analytical tools. Be confident in advancing research, educational goals, and research direction and priorities — all from one database.
Enhance research and scholarship with comprehensive data and analytics
Increase research efficiency.
Having access to comprehensive content and high-quality data is effective only if you can easily find the information you need. The state-of-the-art search tools and filters in Scopus enable you to quickly:
Discover relevant sources
Identify trends in research or emerging topics
Uncover potential research collaborators
Use our Quick Reference Guide to learn about our search features and filters.
Download the Scopus Quick Reference Guide opens in new tab/window
Identify emerging trends
Scopus has comprehensive scholarly literature, data and analytical tools to keep you up-to-date and ahead of the competition.
97.3M+ records
28,300+ active serial titles
368,000+ books
Watch this video opens in new tab/window to get a quick overview of how Scopus helps organizations of any size progress basic and applied research, support educational goals, and inform research strategies.
Download our fact sheet opens in new tab/window with Scopus content figures and the latest product updates.
Learn more about Scopus content
Accelerate your research
In the ever-changing landscape of academic research, staying at the forefront requires modern tools. Scopus AI is an AI-powered tool that helps you navigate the vast amount of information available in Scopus, allowing you to gain a deeper understanding of your research topic, generate new insights, and enhance your overall research experience.
Scopus AI accelerates the journey from inquiry to discovery, enabling you to push the boundaries of knowledge and drive innovation in your field.
Learn more about Scopus AI
Inform strategic research decisions
Scopus empowers organizations with unparalleled access to critical global research, which can be integrated with existing platforms to increase analysis and insights.
Its advanced suite of analytical tools helps users visualize, compare and export data to evaluate research output and trends, assisting in measuring research performance at the individual or institutional level, and helping inform strategic research decisions.
Learn more about Scopus data
Enhance research visibility
Scopus Author Profiles offer new insights into the reach and influence of research, helping to build a reliable body of work to support career goals. Once a profile is validated, Scopus takes over, automatically populating it and continuously building on an author's credentials.
Scopus is the only database to blend automated and manually curated data to generate current author profiles. This process allows us to deliver over 17m profiles that support accurate author searches in the same way you can search for articles: efficiently and easily.
Learn more about Author Profiles
Show journal, article & author influence
Scopus outperforms other abstract and citation databases by providing a broader range of research metrics covering nearly twice the number of peer-reviewed publications.
Using Scopus metrics, you can demonstrate the influence of your institution's scholarly output. Discover the details behind our metrics, giving you confidence in knowing how the numbers are derived.
Learn more about Scopus metrics
What's new?
As we continue to refine and update Scopus AI, we’re excited to announce the release of Copilot, a new feature for Scopus AI to handle specific and complex queries.
Learn more about Copilot
It's here! CiteScore 2023 is now available, providing transparent insights into journal citation impact.
Learn more about CiteScore 2023
Why choose Scopus?
Industry-leading collection of scholarly abstracts and citations, comprehensive coverage, greater insights, independent review and selection, intuitive search, better tools, better results, more metrics, serve your organization's research and education needs, scopus for enterprise.
Scopus is available as a subscription only for organizations. Individuals, please contact your library or information resource center.
View Scopus for free
See what Scopus can do for you by visiting Scopus Preview for free.
"When it comes to measuring success, you can’t compare other products to Scopus — no other output metrics offer the same kind of depth and coverage ... faculty, department chairs, college deans, they are always amazed when they discover what’s possible." Read the full customer story opens in new tab/window
Hector R. Perez-Gilbe
Research Librarian for the Health Sciences, University of California, Irvine (USA)
Frequently asked questions
How do i get a complete list of titles indexed in scopus.
Use our free Scopus Preview opens in new tab/window to get a complete list of titles indexed in Scopus and access to Scopus metrics.
How do I request changes to an author profile?
To request changes to an author profile, follow the link to learn about our Author Profile Wizard opens in new tab/window .
How do I request a title correction on Scopus?
Learn more about making title corrections opens in new tab/window and changes opens in new tab/window .
How do I submit a journal, book or conference for indexing?
Visit the Scopus Content Policy & Selection page for more information about submitting a journal, book or conference for indexing.
Where can I find information about Scopus APIs?
To learn about Scopus APIs, please visit our Developer Portal opens in new tab/window .
Learn how Scopus can help your organization achieve its goals.
Related links
Reference management. Clean and simple.
Academic Databases
ERIC research database: complete tutorial
The ERIC database is the premier education literature database for scholarly research. This guide covers search types and strategies, filters, and full text options.
How to efficiently search online databases for academic research
Academic research isn't difficult if you know where and how to search for scholarly articles and research papers. Here's how to do it.
How to use Google Scholar: the ultimate guide
Google Scholar is the number one academic search engine. Our detailed guide covers best practices for basic and advanced search strategies in Google Scholar.
How to use PubMed: the ultimate guide
PubMed is the most popular search engine for biomedical sciences. Learn how to use PubMed, basic and advanced search strategies, and about its limitations and alternatives in this ultimate guide.
Is Google Scholar a database or search engine? [Update 2024]
Google Scholar is the number one free resource to discover scientific literature, but is it an academic database or a search engine?
The best academic research databases [Update 2024]
Your research is stuck and you need to find new sources? Take a look at our compilation of academic research databases: Scopus, Web of Science, PubMed, ERIC, JSTOR, DOAJ, Science Direct, and IEEE Xplore.
The best academic search engines [Update 2024]
Your research is stuck, and you need to find new sources. Take a look at our compilation of free academic search engines: ✓ Google Scholar ✓ BASE ✓ CORE ✓ Science.gov
The best research databases for computer science [Update 2024]
The top 4 research databases specifically dedicated to computer science: ✓ ACM Digital Library ✓ IEEE Xplore ✓ dbpl ✓ Springer LNCS
The best research databases for healthcare and medicine [Update 2024]
We have compiled the top list of research databases for healthcare, medicine, and biomedical research: PubMed, EMBASE, PMC, and Cochrane Library.
An official website of the United States government
The .gov means it's official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you're on a federal government site.
The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.
- Publications
- Account settings
National Center for Biotechnology Information
Welcome to ncbi.
The National Center for Biotechnology Information advances science and health by providing access to biomedical and genomic information.
- About the NCBI |
- Organization |
- NCBI News & Blog
Deposit data or manuscripts into NCBI databases
Transfer NCBI data to your computer
Find help documents, attend a class or watch a tutorial
Use NCBI APIs and code libraries to build applications
Identify an NCBI tool for your data analysis task
Explore NCBI research and collaborative projects
- Resource List (A-Z)
- All Resources
- Chemicals & Bioassays
- Data & Software
- DNA & RNA
- Domains & Structures
- Genes & Expression
- Genetics & Medicine
- Genomes & Maps
- Sequence Analysis
- Training & Tutorials
Popular Resources
- PubMed Central
Connect with NLM
National Library of Medicine 8600 Rockville Pike Bethesda, MD 20894
Web Policies FOIA HHS Vulnerability Disclosure
Help Accessibility Careers
Search Google Appliance
- Online Book Collections
- Online Books by Topic
- Biodiversity Heritage Library
- Library Catalog (SIRIS)
- Image Gallery
- Art & Artist Files
- Caldwell Lighting
- Trade Literature
- All Digital Collections
- Current Exhibitions
- Online Exhibitions
- Past Exhibitions
- Index of Library & Archival Exhibitions on the Web
- Research Tools and OneSearch
- E-journals, E-books, and Databases
- Smithsonian Research Online (SRO)
- Borrowing and Access Privileges
- Smithsonian Libraries and Archives on PRISM (SI staff)
- E-news Sign Up
- Internships and Fellowships
- Work with Us
- About the Libraries
- Library Locations
- Departments
- History of the Libraries
- Advisory Board
- Annual Reports
- Adopt-a-Book
- Ways to Give
- Gifts-in-Kind
You are here
Databases for science research.
This list reflects just some of the science databases available to researchers from the Smithsonian Libraries and Archives. For a complete list of subscription and vetted databases go to E-journals, E-books, and Databases . For more subject-specific resources see our Science Research Guides .
Databases that require SI network for access are indicated by "SI staff." For information about remote access see Off-Site Access to Electronic Resources .
Broad Science Research Databases
- AGRICOLA : The National Agricultural Library's comprehensive database covering agriculture and allied disciplines, including: chemistry, engineering, entomology, forestry, social science (general), and water resources.
- Anthropology Plus (SI staff): Index of journal articles, and additional resources from core and lesser-known journals from the early 19th century to today.
- Encyclopedia of Life (EOL) : Online encyclopedia of all living species, currently number over 1.9 million.
- GeoRef (SI staff): Over 3.4 million references in the geosciences, including journal articles, books, maps, conference papers, reports and theses.
- Google Scholar : Accessing Google Scholar from the Smithsonian computer network provides access to library-subscribed full text. For more information, see Off-Site Access to Electronic Resources .
- Journal Citation Reports (SI staff): Provides impact factors and rankings of many journals in the social and life sciences based on citation analysis.
- PubMed : Includes more than 22 million citations for biomedical literature from MEDLINE, life science journals, and online books.
- U.S. Geological Survey Library : Among the largest geoscience library collections in the world.
- Web of Science: Core Collection (SI staff): Covers over 12,000 of the highest impact journals worldwide with coverage from 1900 to present. See the Web of Science training portal for additional help resources.
- Worldcat (web) or WorldCat (OCLC FirstSearch) (SI staff): Combined search of thousands of library catalogs from around the world, including books, music, videos, and digital content records.
- Zoological Record (SI staff): Considered the world's leading taxonomic reference for zoological names, indexing 90% of the world literature in zoology.
Focused Science Databases
For research guides on a variety of natural history topics, including additional databases and resources, see our Science Research Guides .
- Algaebase : Database of taxonomic, nomenclatural, and distributional information on terrestrial, marine, and freshwater algae organisms.
- AnimalBase: Early Zoological Literature Online : Hosted by the Zoological Institute of the University of Gottingen, this database provides open access to zoological works from 1550-1770.
- AnthroSource (SI staff): Full-text anthropological resources from the breadth and depth of the discipline.
- AquaDocs : A thematic repository covering the natural marine, coastal, estuarine /brackish and fresh water environments.
- Birds of North America Online : Comprehensive resource from the Cornell Lab of Ornithology and the American Ornithologists' Union.
- Catalogue of Life (Species 2000 - ITIS) : Project that catalogued over one million species as of 2001.
- FishBase : Database covering the breadth of all known species, considered a powerful tool for ecology.
- GreenFILE (SI staff): Covers connections between the environment and a variety of disciplines and includes topics such as global climate change, green building, and more.
- Index of Botanical Publications (Harvard University Herbaria): Created to assist in the verification of publication names in the Specimen Database and the Gray Index.
- Index of Botanists (Harvard University Herbaria): Comprehensive database of authors and collectors in botany, mycology, including systematic publications.
- IOPI Database of Plant Databases : Hosted by Charles Sturt University, this meta-database allows the user search granularity across 100 metadata fields.
- International Plant Names Index : Nomenclatural database for the scientific names of vascular plants, linking directly to the Biodiversity Heritage Library .
- ITIS: Integrated Taxonomic Information System : Created through international partnerships (including the Smithsonian), this database hosts authoritative taxonomic information on plants, animals, fungi, and microbes.
- JSTOR Global Plants (SI staff): Includes plant type specimens, taxonomic structures, scientific literature, and related materials.
- KBD: Kew Bibliographic Databases : Selection of 24 botanical databases containing information on correspondence, herbaria, seed lists, etc.
- Latindex : Regional Cooperative Online Information System for Scholarly Journals from Latin America, the Caribbean, Spain and Portugal. For more details, visit http://en.wikipedia.org/wiki/Latindex .
- National Museum of Natural History Research and Collections Information System (EMu) : Over ten million specimen records covering six departments and four divisions of the National Museum of Natural History.
- SORA: Searchable Ornithological Research Archive : Developed by the University Libraries at the University of New Mexico, SORA is the world’s largest open access ornithological publications database.
Digitized Science Collections
- AnimalBase: Early Zoological Literature Online : Hosted by the Zoological Institute of the University of Gottingen, this database provides open access to zoological works from 1550-1770.
- Biodiversity Heritage Library : A digital library containing primarily historical texts in the natural sciences.
- Field Book Project (Smithsonian Archives) : With the purpose of illuminating unpublished works integral to scientific research, the FBP database contains 4,000 digitized field books and 9,500 catalogued field books, in total.
- Joseph Henry Papers Project (Smithsonian Archives) : The scientific output of the first Secretary of the Smithsonian. Contains over 170,000 documents in fifteen scientific disciplines.
Database Search
What is Database Search?
Harvard Library licenses hundreds of online databases, giving you access to academic and news articles, books, journals, primary sources, streaming media, and much more.
The contents of these databases are only partially included in HOLLIS. To make sure you're really seeing everything, you need to search in multiple places. Use Database Search to identify and connect to the best databases for your topic.
In addition to digital content, you will find specialized search engines used in specific scholarly domains.
Related Services & Tools
This website uses cookies to ensure you get the best experience. Learn more about DOAJ’s privacy policy.
Hide this message
You are using an outdated browser. Please upgrade your browser to improve your experience and security.
The Directory of Open Access Journals
Directory of Open Access Journals
Find open access journals & articles.
Doaj in numbers.
80 languages
135 countries represented
13,753 journals without APCs
20,943 journals
10,495,331 article records
Quick search
About the directory.
DOAJ is a unique and extensive index of diverse open access journals from around the world, driven by a growing community, and is committed to ensuring quality content is freely available online for everyone.
DOAJ is committed to keeping its services free of charge, including being indexed, and its data freely available.
→ About DOAJ
→ How to apply
DOAJ is twenty years old in 2023.
Fund our 20th anniversary campaign
DOAJ is independent. All support is via donations.
82% from academic organisations
18% from contributors
Support DOAJ
Publishers don't need to donate to be part of DOAJ.
News Service
Meet the doaj team: head of editorial and deputy head of editorial (quality), vacancy: operations manager, press release: pubscholar joins the movement to support the directory of open access journals, new major version of the api to be released.
→ All blog posts
We would not be able to work without our volunteers, such as these top-performing editors and associate editors.
→ Meet our volunteers
Librarianship, Scholarly Publishing, Data Management
Brisbane, Australia (Chinese, English)
Adana, Türkiye (Turkish, English)
Humanities, Social Sciences
Natalia Pamuła
Toruń, Poland (Polish, English)
Medical Sciences, Nutrition
Pablo Hernandez
Caracas, Venezuela (Spanish, English)
Research Evaluation
Paola Galimberti
Milan, Italy (Italian, German, English)
Social Sciences, Humanities
Dawam M. Rohmatulloh
Ponorogo, Indonesia (Bahasa Indonesia, English, Dutch)
Systematic Entomology
Kadri Kıran
Edirne, Türkiye (English, Turkish, German)
Library and Information Science
Nataliia Kaliuzhna
Kyiv, Ukraine (Ukrainian, Russian, English, Polish)
Recently-added journals
DOAJ’s team of managing editors, editors, and volunteers work with publishers to index new journals. As soon as they’re accepted, these journals are displayed on our website freely accessible to everyone.
→ See Atom feed
→ A log of journals added (and withdrawn)
→ DOWNLOAD all journals as CSV
- Journal of Biomedical & Clinical Research
- Nota al Margen
- RGUHS National Journal of Public Health
- Magistra Andalusia
- Revista Trágica
- Shiyou huagong gaodeng xuexiao xuebao
- JMIR XR and Spatial Computing
- Revista de Investigación Educativa Intervención Pedagógica y Docencia
- Revista Científica SENAI-SP
- Revista Educação e Emancipação
- Jurnal Gizi dan Pangan Soedirman
- Germanica Wratislaviensia
- Latin American Law Review
- Geografia (Londrina)
WeChat QR code
Librarians/Admins
- EBSCOhost Collection Manager
- EBSCO Experience Manager
- EBSCO Connect
- Start your research
- EBSCO Mobile App
Clinical Decisions Users
- DynaMed Decisions
- Dynamic Health
- Waiting Rooms
- NoveList Blog
- All Resources
Free Databases
EBSCO provides free research databases covering a variety of subjects for students, researchers and librarians.
Exploring Race in Society
This free research database offers essential content covering important issues related to race in society today. Essays, articles, reports and other reliable sources provide an in-depth look at the history of race and provide critical context for learning more about topics associated with race, ethnicity, diversity and inclusiveness.
EBSCO Open Dissertations
EBSCO Open Dissertations is a collaboration between EBSCO and BiblioLabs to increase traffic and discoverability of ETD research. You can join the movement and add your theses and dissertations to the database, making them freely available to researchers everywhere.
GreenFILE is a free research database covering all aspects of human impact to the environment. Its collection of scholarly, government and general-interest titles includes content on global warming, green building, pollution, sustainable agriculture, renewable energy, recycling, and more.
Library, Information Science and Technology Abstracts
Library, Information Science & Technology Abstracts (LISTA) is a free research database for library and information science studies. LISTA provides indexing and abstracting for hundreds of key journals, books, research reports. It is EBSCO's intention to provide access to this resource on a continual basis.
Teacher Reference Center
A complimentary research database for teachers, Teacher Reference Center (TRC) provides indexing and abstracts for more than 230 peer-reviewed journals.
European Views of the Americas: 1493 to 1750
European Views of the Americas: 1493 to 1750 is a free archive of indexed publications related to the Americas and written in Europe before 1750. It includes thousands of valuable primary source records covering the history of European exploration as well as portrayals of Native American peoples.
Recommended Reading
- Sign into My Research
- Create My Research Account
- Company Website
- Our Products
- About Dissertations
- Español (España)
- Support Center
Select language
- Bahasa Indonesia
- Português (Brasil)
- Português (Portugal)
Welcome to My Research!
You may have access to the free features available through My Research. You can save searches, save documents, create alerts and more. Please log in through your library or institution to check if you have access.
Translate this article into 20 different languages!
If you log in through your library or institution you might have access to this article in multiple languages.
Get access to 20+ different citations styles
Styles include MLA, APA, Chicago and many more. This feature may be available for free if you log in through your library or institution.
Looking for a PDF of this document?
You may have access to it for free by logging in through your library or institution.
Want to save this document?
You may have access to different export options including Google Drive and Microsoft OneDrive and citation management tools like RefWorks and EasyBib. Try logging in through your library or institution to get access to these tools.
ProQuest Basic Search page
- Scholarly Journals
- Videos & Audio
- Dissertations & Theses
- Archival Materials
- Blogs, Podcasts, & Websites
- Conference Papers & Proceedings
- Encyclopedias & Reference Works
- Government & Official Publications
- Historical Newspapers
- Historical Periodicals
- Other Sources
- Pamphlets & Ephemeral Works
- Standards & Practice Guidelines
- Topic Pages
- Trade Journals
- Working Papers
- About ProQuest
- Terms of Use
- Privacy Policy
- Cookie Policy
Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.
- View all journals
- Explore content
- About the journal
- Publish with us
- Sign up for alerts
- Data Descriptor
- Open access
- Published: 01 June 2023
SciSciNet: A large-scale open data lake for the science of science research
- Zihang Lin ORCID: orcid.org/0000-0003-4262-6354 1 , 2 , 3 , 4 ,
- Yian Yin ORCID: orcid.org/0000-0003-3018-4544 1 , 2 , 3 , 5 ,
- Lu Liu 1 , 2 , 3 &
- Dashun Wang ORCID: orcid.org/0000-0002-7054-2206 1 , 2 , 3 , 5
Scientific Data volume 10 , Article number: 315 ( 2023 ) Cite this article
21k Accesses
23 Citations
70 Altmetric
Metrics details
- Scientific community
The science of science has attracted growing research interests, partly due to the increasing availability of large-scale datasets capturing the innerworkings of science. These datasets, and the numerous linkages among them, enable researchers to ask a range of fascinating questions about how science works and where innovation occurs. Yet as datasets grow, it becomes increasingly difficult to track available sources and linkages across datasets. Here we present SciSciNet, a large-scale open data lake for the science of science research, covering over 134M scientific publications and millions of external linkages to funding and public uses. We offer detailed documentation of pre-processing steps and analytical choices in constructing the data lake. We further supplement the data lake by computing frequently used measures in the literature, illustrating how researchers may contribute collectively to enriching the data lake. Overall, this data lake serves as an initial but useful resource for the field, by lowering the barrier to entry, reducing duplication of efforts in data processing and measurements, improving the robustness and replicability of empirical claims, and broadening the diversity and representation of ideas in the field.
Similar content being viewed by others
Data, measurement and empirical methods in the science of science
A dataset for measuring the impact of research data and their curation
Envisioning a “science diplomacy 2.0”: on data, global challenges, and multi-layered networks
Background & summary.
Modern databases capturing the innerworkings of science have been growing exponentially over the past decades, offering new opportunities to study scientific production and use at larger scales and finer resolution than previously possible. Fuelled in part by the increasing availability of large-scale datasets, the science of science community turns scientific methods on science itself 1 , 2 , 3 , 4 , 5 , 6 , helping us understand in a quantitative fashion a range of important questions that are central to scientific progress—and of great interest to scientists themselves—from the evolution of individual scientific careers 7 , 8 , 9 , 10 , 11 , 12 , 13 , 14 , 15 , 16 , 17 , 18 to collaborations 19 , 20 , 21 , 22 , 23 , 24 , 25 and science institutions 26 , 27 , 28 to the evolution of science 2 , 3 , 5 , 29 , 30 , 31 , 32 , 33 , 34 to the nature of scientific progress and impact 35 , 36 , 37 , 38 , 39 , 40 , 41 , 42 , 43 , 44 , 45 , 46 , 47 , 48 , 49 , 50 , 51 , 52 , 53 , 54 , – 55 .
Scholarly big data have flourished over the past decade, with several large-scale initiatives providing researchers free access to data. For example, CiteSeerX 56 , one of the earliest digital library search engines, offers a large-scale scientific library focusing on the literature in computer and information science. Building on a series of advanced data mining techniques, AMiner 57 indexes and integrates a wide range of data about academic social networks 58 . Crossref ( https://www.crossref.org/ ) 59 , as well as other initiatives in the open metadata community, have collected metadata such as Digital Object Identifier (DOI) in each publication record and linked them to a broad body of event data covering scholarly discussions. OpenAlex ( https://openalex.org/ ) 60 , based on Microsoft Academic Graph (MAG) 61 , 62 , 63 , aims to build a large-scale open catalog for the global research system, incorporating scholarly entities and their connections across multiple datasets. In addition to data on scientific publications and citations capturing within-science dynamics, researchers have also tracked interactions between science and other socioeconomic spheres by tracing, for example, how science is referenced in patented inventions 64 , 65 , 66 , regarding both front-page and in-text citations from patents to publications 67 , 68 . Table 1 summarizes several exemplary datasets commonly used in the science of science literature, with information on their coverage and accessibility.
The rapid growth of the science of science community 69 , 70 , 71 , combined with its interdisciplinary nature, raises several key challenges confronting researchers in the field. First, it becomes increasingly difficult to keep track of available datasets and their potential linkages across disparate sources, raising the question of whether there are research questions that are underexplored simply due to a lack of awareness of the data. Second, as data and their linkages become more complex, there are substantial data pre-processing steps involved prior to analyses. Many of these steps are often too detailed to document in publications, with researchers making their own analytical choices when processing the data. Third, as tools and techniques used in the science of science grow in sophistication, measurements on these datasets can be computationally involved, requiring substantial investment of time and resources to compute these measures.
All these challenges highlight the need for a common data resource designed for research purposes, which could benefit the community in several important ways. First, it provides a large-scale empirical basis for research, helping to strengthen the level of evidence supporting new findings as well as increase the replicability and robustness of these findings. Second, it helps to reduce duplication of efforts across the community in data preprocessing and common measurements. Third, by compiling various datasets, linkages, and measurements, the data resource significantly lowers the barrier to entry, hence has the potential to broaden the diversity and representation of new ideas in the field.
To support these needs in the community, we present SciSciNet, a large-scale open data lake for the science of science research. The data lake not only incorporates databases that capture scientific publications, researchers, and institutions, but also tracks their linkages to related entities, ranging from upstream funding sources like NIH and NSF to downstream public uses, including references of scientific publications in patents, clinical trials, and media and social media mentions (see Fig. 1 and Table 2 for more details of entities and their relationships). Building on this collection of linked databases, we further calculate a series of commonly used measurements in the science of science, providing benchmark measures to facilitate further investigations while illustrating how researchers can further contribute collectively to the data lake. Finally, we validate the data lake using multiple approaches, including internal data validation, cross-database verification, as well as reproducing canonical results in the literature.
The entity relationship diagram of SciSciNet. SciSciNet includes “SciSciNet_Papers” as the main data table, with linkages to other tables capturing data from a range of sources. For clarity, here we show a subset of the tables (see Data Records section for a more comprehensive view of the tables). PK represents primary key, and FK represents foreign key.
The data lake, SciSciNet, is freely available at Figshare 72 . At the core of the data lake is the Microsoft Academic Graph (MAG) dataset 61 , 62 , 63 . The MAG data is one of the largest and most comprehensive bibliometrics data in the world, and a popular dataset for the science of science research. However, MAG was sunset by Microsoft at the end of 2021. Since then, there have also been several important efforts in the community to ensure the continuity of data and services. For example, there are mirror datasets 73 available online for MAG, and the OpenAlex ( https://openalex.org ) initiative builds on the MAG data, and not only makes it open to all but also provides continuous updates 60 . While these efforts have minimized potential disruptions, the sunsetting of MAG has also accelerated the need to construct open data resources designed for research purposes. Indeed, large-scale systematic datasets for the science of science mostly come in the form of raw data, which requires further data pre-processing and filtering operations to extract fine-grained research data with high quality. It usually takes substantial efforts and expertise to clean the data, and many of these steps are often too detailed to document in publications, with researchers making their own analytical choices. It thus suggests that there is value in constructing an open data lake, which aims to continue to extend the usefulness of MAG, with substantial data pre-processing steps documented. Moreover, the data lake links together several disparate sources and pre-computed measures commonly used in the literature, serving as an open data resource for researchers interested in the quantitative studies of science and innovation.
Importantly, the curated data lake is not meant to be exhaustive; rather it represents an initial step toward a common data resource to which researchers across the community can collectively contribute. Indeed, as more data and measurements in the science of science become available, researchers can help to contribute to the continuous improvement of this data lake by adding new data, measurements, and linkages, thereby further increasing the utility of the data lake. For example, if a new paper reports a new measurement, the authors could publish a data file linking the new measurement with SciSciNet IDs, which would make it much easier for future researchers to build on their work.
Data selection and curation from MAG
The Microsoft Academic Graph (MAG) dataset 61 , 62 , 63 covers a wide range of publication records, authors, institutions, and citation records among publications. MAG has a rich set of prominent features, including the application of advanced machine learning algorithms to classify fields of study in large-scale publication records, identify paper families, and disambiguate authors and affiliations. Here we use the edition released on December 6 th , 2021 by MAG, in total covering 270,694,050 publication records.
The extensive nature of the MAG data highlights a common challenge. Indeed, using the raw data for research often requires substantial pre-processing and data-cleaning steps to arrive at a research-ready database. For example, one may need to perform a series of data selection and curation operations, including the selection of scientific publications with reliable sources, aggregation of family papers, and redistribution of citation and reference counts. After going through these steps, one may generate a curated publication data table, which serves as the primary scientific publication data table in SciSciNet (Table 3 , “SciSciNet_Papers”). However, each of these steps requires us to make specific analytical choices, but given the detailed nature of these steps, the specific choices made through these steps have remained difficult to document through research publications.
Here we document in detail the various procedures we took in constructing the data lake. From the original publication data in MAG, we use MAG Paper ID as the primary key, and consider a subset of main attributes, including DOI (Digital Object Identifier), document type and publication year. As we are mainly interested in scientific publications within MAG, we first remove paper records whose document type is marked as patent. We also remove those with neither document type nor DOI information. Each scientific publication in the database may be represented by different entities (e.g., preprint and conference), indicated as a paper “family” in MAG. To avoid duplication, we aggregate all papers in the same family into one primary paper. We also do not include retracted papers in the primary paper table in SciSciNet. Instead, we include records of retracted papers and affiliated papers in paper families in another data table “SciSciNet_PaperDetails” (Table 8 ) linked to the primary paper table, recording information of DOIs, titles, original venue names, and original counts for citations and references in MAG. Following these steps, the primary data table “SciSciNet_Papers” contains 134,129,188 publication records with unique primary paper ids, including 90,764,813 journal papers, 4,629,342 books, 3,932,366 book chapters, 5,123,597 conference papers, 145,594 datasets, 3,083,949 repositories, 5,998,509 thesis papers, and 20,451,018 other papers with DOI information.
For consistency, we recalculate the citation and reference counts within the subset of 134 M primary papers, such that each citation or reference record is also included in this subset and can be found in “SciSciNet_PaperReferences” (Table 5 ). For papers in the same family, we aggregate their citations and references into the primary paper and drop duplicated citation pairs. Building on the updated citations, we recalculate the number of references and citations for each primary paper.
MAG also contains information of authors, institutions, and fields. While author disambiguation 58 , 74 , 75 , 76 , 77 , 78 , 79 remains a major challenge, we adopt the author disambiguation method from MAG and create an author table, which offers a baseline for future studies of individual careers. We also supplement the author table with empirical name-gender associations to support gender research 80 , drawing from work by Van Buskirk et al . 80 ; this allows us to build “SciSciNet_Authors_Gender” (Table 9 ) with 134,197,162 author records including their full names.
For fields, we use the fields of study records from MAG and focus on the records related to the selected primary papers (19 Level-0 fields and 292 Level-1 fields, Table 6 ). We incorporate this information into two tables, the “SciSciNet_PaperAuthorAffiliations” (Table 4 ) and “SciSciNet_PaperFields” (Table 7 ), with 413,869,501 and 277,494,994 records, respectively.
We further use the information of “PaperExtendedAttributes” table from MAG to construct high-quality linkages between MAG Paper ID and PubMed Identifier (PMID). We drop duplicate links by only keeping the MAG primary paper record (if one PMID was linked to multiple MAG Paper IDs) or the latest updated PubMed record (if one MAG Paper ID was linked to multiple PMIDs), obtaining 31,230,206 primary MAG Paper ID-PMID linkages (95.6% of the original records) to further support linkage with external sources.
Together, the resulting SciSciNet includes 134,129,188 publications (Table 3 ), 134,197,162 authors (Table 9 ), 26,998 institutions (Table 10 ), 49,066 journals (Tables 21 ), 4,551 conference series (Tables 22 ), 19 top-level fields of study, 292 subfields (Table 6 ), and the internal links between them, including 1,588,739,703 paper-references records (Table 5 ), 413,869,501 paper-author-affiliations records (Table 4 ), and 277,494,994 paper-fields records (Table 7 ).
Linking publication data with external sources
While the main paper table captures citation relationships among scientific publications, there has been growing interest in studying how science interacts with other socioeconomic institutions 35 , 36 , 41 , 55 , 81 , 82 . Here, we further trace references of scientific publications in data sources that go beyond publication datasets, tracking the linkage between papers to their upstream funding supports and downstream uses in public domains. Specifically, here we link papers to the grants they acknowledge in NSF and NIH, as well as public uses of science by tracking references of scientific publications in patents, clinical trials, and news and social media.
NIH funding
The National Institutes of Health (NIH) is the largest public funder for biomedical research in the world. The recent decade has witnessed increasing interest in understanding the role of NIH funding for the advancement of biomedicine 81 , 82 and its impact on individual career development 83 , 84 . NIH ExPORTER provides bulk NIH RePORTER ( https://report.nih.gov/ ) data on research projects funded by the NIH and other major HHS operating divisions. The database also provides link tables (updated on May 16, 2021) that connects funded projects with resulting publications over the past four decades.
To construct the funded project-paper linkages between SciSciNet Paper ID and NIH Project Number, we use the PMID of MAG papers (from our previously curated “PaperExtendedAttributes” table based on MAG) as the intermediate key, matching more than 98.9% of the original NIH link table records to primary Paper ID in SciSciNet. After dropping duplicate records, we end up with a collection of 6,013,187 records (Table 11 ), linking 2,636,061 scientific papers (identified by primary MAG Paper IDs) to 379,014 NIH projects (identified by core NIH-funded project numbers).
NSF funding
Beyond biomedical research, the National Science Foundation (NSF) funds approximately 25% of all federally supported basic research conducted by the United States’ colleges and universities across virtually all fields of science and engineering. NSF provides downloadable information on research projects it has funded, including awardee, total award amount, investigator, and so forth, but no information on funded research publications. While Federal RePORTER offers downloadable files on NSF awards with links to supported publications (662,072 NSF award-publication records by 2019), it only covers a limited time period and has been retired by March 2022. To obtain a more comprehensive coverage of records linking NSF awards to supported papers, we crawl the webpages of all NSF awards to retrieve information on their resulting publications. In particular, we first created a comprehensive list of all NSF award numbers from https://www.nsf.gov/awardsearch/download.jsp . We then iterate over this list to download the entire webpage document of each NSF award (from the URL https://www.nsf.gov/awardsearch/showAward?AWD_ID = [Award number]), and use “Publications as a result of this research” column to identify scientific publications related to this award. We then extract paper titles and relevant information provided by using the Python library ElementTree to navigate and parse the webpage document structurally. We end up collecting 489,446 NSF awards since 1959 (Table 20 ), including linkages between 131,545 awards and 1,350,915 scientific publications.
To process information crawled from NSF.gov, which is presented as raw text strings, we design a text-based multi-level matching process to link NSF awards to SciSciNet scientific publications:
For records with DOI information in the raw texts of funded research publications, we perform an exact match with SciSciNet primary papers through DOI. If the DOI in an NSF publication record matched that of one primary paper, we create a linkage between the NSF Award Number and the primary Paper ID. We matched 458,463 records from NSF awards to SciSciNet primary papers, where each DOI appeared only once in the entire primary paper table, thus enabling association with a unique Paper ID (exact match). After dropping duplicates where the same DOI appears repeatedly in the same NSF award, we yield 350,611 records (26.0%) from NSF awards to SciSciNet primary papers.
To process the rest of the records, we then use the title information of each article for further matching. After extracting the title from NSF records and performing a standardization procedure (e.g., converting each letter into lowercase and removing punctuation marks, extra spaces, tabs, and newline characters), our exact matches between paper titles in the NSF award data and SciSciNet primary paper data yield 246,701 unique matches (18.3% in total) in this step.
We further develop a search engine for records that have not been matched in the preceding steps. Here we use Elasticsearch, a free and open search and analytics engine, to index detailed information (paper title, author, journal or conference name, and publication year) of all SciSciNet primary papers. We then feed raw texts of the crawled NSF publications into the system and obtain results with the top two highest scores associated with the indexed primary papers. Similar to a previous study 55 , we use scores of the second matched primary papers as a null model, and then identify the first matched primary paper as a match if its score is significantly higher than the right-tail cutoff of the second score distribution ( P = 0.05). Following this procedure, we match the remaining 467,159 records (34.6%) from the two previous steps with significantly higher scores (Fig. 2a ). Note that this procedure likely represents a conservative strategy that prioritizes precision over recall. Manually inspecting the rest of potential matchings, we find that those with large differences between the top two Z-scores (Fig. 2b ) are also likely to be correct matches. To this end, we also include these heuristic links, together with the difference of their Z-scores, as fuzzy matching linkages between SciSciNet papers and NSF awards.
Matching NSF reference string to MAG records. ( a ) Distribution of Z-scores for papers matched in ElasticSearch with the first and second highest scores. The vertical red line denotes the right-tail cutoff of the second score distribution ( P = 0.05). ( b ) Distribution of pairwise Z-score differences for papers matched in search engine but with the first score no higher than the right-tail cutoff of the second score distribution ( P = 0.05).
We further supplement these matchings with information from Crossref data dump, an independent dataset that links publications to over 30 K funders including NSF. We collect all paper-grant pairs where the funder is identified as NSF. We then use the raw grant number from Crossref and link paper records between Crossref and SciSciNet using DOIs. We obtain 305,314 records after cleaning, including 196,509 SciSciNet primary papers with DOIs matching to 83,162 NSF awards.
By combining records collected from all these steps, we collect 1,130,641 unique linkages with high confidence levels and 178,877 additional possible linkages from fuzzy matches (Table 12 ). Together these links connect 148,148 NSF awards and 929,258 SciSciNet primary papers.
Patent citations to science
The process in which knowledge transfers from science to marketplace applications has received much attention in science and innovation literature 35 , 41 , 85 , 86 , 87 , 88 . The United States Patent and Trademark Office (USPTO) makes patenting activity data publicly accessible, with the PatentsView platform providing extensive metadata including as related to patent assignees, inventors, and lawyers, along with patents’ internal citations and full-text information. The European Patent Office (EPO) also provides open access to patent data containing rich attributes.
Building on recent advances in linking papers to patents 35 , 67 , 68 , Marx and Fuegi developed a large-scale dataset of over 40 M citations from USPTO and EPO patents to scientific publications in MAG. Using this corpus (Version v34 as of December 24, 2021), we merge 392 K patent citation received by affiliated MAG papers to their respective primary IDs in the same paper family. Dropping possible duplicate records with the same pair of primary Paper ID and Patent ID results in 38,740,313 paper-patent citation pairs between 2,360,587 patents from USPTO and EPO and 4,627,035 primary papers in SciSciNet (Table 15 ).
Clinical trials citations to science
Understanding bench-to-bed-side translation is essential for biomedical research 81 , 89 . ClinicalTrials.gov provides publicly available clinical study records covering 50 U.S. states and 220 countries, sourced from the U.S. National Library of Medicine. The Clinical Trials Transformation Initiative (CTTI) makes available clinical trials data through a database for Aggregate Analysis of ClinicalTrials.gov (AACT), an aggregated relational database helping researchers better study drugs, policies, publications, and other related items to clinical trials.
Overall, the data covers 686,524 records linking clinical trials to background or result papers (as of January 26th, 2022). We select 480,893 records with papers as reference background supporting clinical trials, of which 451,357 records contain 63,281 unique trials matching to 345,797 reference papers with PMIDs. Similar to the process of linking scientific publications to NIH-funded projects, we again establish linkages between SciSciNet primary Paper ID and NCT Number (National Clinical Trial Number) via PMID, aided by the curated “PaperExtendedAttributes” table as the intermediary. After standardizing the data format of the intermediate index PMID to merge publications and clinical trials, we obtain 438,220 paper-clinical linkages between 61,447 NCT clinical trials and 337,430 SciSciNet primary papers (Table 13 ).
News and social mentions of science
Understanding how science is mentioned in media has been another important research direction in the science of science community 44 , 90 . The Newsfeed mentions in Crossref Event Data link scientific papers in Crossref 59 with DOIs to news articles or blog posts in RSS and Atom feeds, providing access to the latest scientific news mentions from multiple sources, including Scientific American , The Guardian , Vox , The New York Times , and others. Also, Twitter mentions in Crossref Event Data link scientific papers to tweets created by Twitter users, offering an opportunity to explore scientific mentions in Twitter.
We use the Crossref Event API to collect 947,160 records between 325,396 scientific publications and 387,578 webpages from news blogs or posts (from April 5 th , 2017 to January 16 th , 2022) and 59,593,281 records between 4,661,465 scientific publications and 58,099,519 tweets (from February 7 th , 2017 to January 17 th , 2022).
For both news media and social media mentions, we further link Crossref’s publication records to SciSciNet’s primary papers. To do so, we first normalize the DOI format of these data records and converted all alphabetic characters to lowercase. We use normalized DOI as the intermediate index, as detailed below:
For news media mentions, we construct linkages between primary Paper ID and Newsfeed Object ID (i.e., the webpage of news articles or blog posts) by inner joining normalized DOIs. We successfully link 899,323 records from scientific publications to news webpages in the Newsfeed list, accounting for 94.9% of the total records. The same news mention may be collected multiple times. After removing duplicate records, we end up with 595,241 records, linking 307,959 papers to 370,065 webpages from Newsfeed (Table 17 ).
Similarly, for social media mentions, we connect primary Paper IDs with Tweet IDs through inner joining normalized DOIs, yielding 56,121,135 records, more than 94% of the total records. After dropping duplicate records, we keep 55,846,550 records, linking 4,329,443 papers to 53,053,505 tweets (Table 16 ).
We also provide metadata of paper-news linkages, including the mention time and the detailed mention information in Newsfeed, to better support future research on this topic (Table 18 ). Similarly, we also offer the metadata of paper-tweet links, including the mention time and the original collected Tweet ID so that interested researchers can merge with further information from Twitter using the Tweet ID (Table 19 ).
Nobel Prize data from the dataset of publication records for Nobel laureates
We integrate a recent dataset by Li et al . 91 in the data lake, containing the publication records of Nobel laureates in science from 1900 to 2016, including both Nobel prize-winning works and other papers produced in their careers. After mapping affiliated MAG Paper IDs to primary ones, we obtain 87,316 publication records of Nobel laureates in SciSciNet primary paper Table (20,434 in physics, 38,133 in chemistry, and 28,749 in physiology/medicine, Table 14 ).
Calculation of commonly used measurements
Using the constructed dataset, we further calculate a range of commonly used measurements of scientific ideas, impacts, careers, and collaborations. Interested readers can find more details and validations of these measurements in the literature 15 , 19 , 20 , 46 , 47 , 48 , 92 , 93 , 94 , 95 , 96 , 97 , 98 .
Publication-level
The number of researchers and institutions in a scientific paper.
Building on team science literature 19 , 27 , we calculate the number of authors and the number of institutions for each paper as recorded in our data lake. We group papers by primary Paper ID in the selected “SciSciNet_PaperAuthorAffiliations” table and aggregate the unique counts of Author IDs and Affiliation IDs as the number of researchers (team size) and institutions, respectively.
Five-year citations ( c 5 ), ten-year citations ( c 10 ), normalized citation ( c f ), and hit paper
The number of citations of a paper evolves over time 46 , 48 , 99 , 100 . Here we calculate c 5 and c 10 , defined as the number of citations a paper received within 5 years and 10 years of publication, respectively. For the primary papers, we calculate c 5 for all papers published up to 2016 (As the last version of MAG publication data is available until 2021) by counting the number of citation pairs with time difference less than or equal to 5 years. Similarly, we calculate c 10 for all papers published up to 2011.
To compare citation counts across disciplines and time, Radicchi et al . 48 proposed the relative citation indicator c f , as the total number of citations c divided by the average number of citations c 0 in the same field and the same year. Here we calculate the normalized citation indicator for each categorized paper in both top-level fields and subfields, known as Level-0 fields (19 in total) and Level-1 fields (292 in total) categorized by MAG, respectively. Note that each paper may be associated with multiple fields, hence here we report calculated normalized citations for each paper-field pair in the “SciSciNet_PaperFields” data table.
Another citation-based measure widely used in the science of science literature 16 , 19 , 83 is “hit papers”, defined as papers in the top 5% of citations within the same field and year. Similar to our calculation of c f , we use the same grouping by fields and years, and identify all papers with citations greater than the top 5% citation threshold. We also perform similar operations for the top 1% and top 10% hit papers.
Citation dynamics
A model developed by Wang, Song, and Barabási (the WSB model) 46 captures the long-term citation dynamics of individual papers after incorporating three fundamental mechanisms, including preferential attachment, aging, and fitness. The model predicts the cumulative citations received by paper i at time t after publication: \({c}_{i}^{t}=m\left[{e}^{{{\rm{\lambda }}}_{i}\Phi \left(\frac{lnt-{{\rm{\mu }}}_{i}}{{{\rm{\sigma }}}_{i}}\right)}-1\right]\) , where Φ ( x ) is the standard cumulative normal distribution of x , m captures the average number of references per paper, and μ i , σ i , and λ i indicate the immediacy, longevity, and fitness parameters characterizing paper i , respectively.
We implement the WSB model with prior for papers published in the fields of math and physics. Following the method proposed by Shen et al . 92 , we adopt the Bayesian approach to calculate the conjugate prior, which follows a gamma distribution. The method allows us to better predict the long-term impact through the posterior estimation of λ i , while helping to avoid potential overfitting problems. Fitting this model to empirical data, we compute the immediacy μ i , the longevity σ i , and the ultimate impact \({c}_{{\rm{i}}}^{\infty }={\rm{m}}\left[{e}^{{{\rm{\lambda }}}_{i}}-1\right]\) for all math and physics papers with at least 10 citations within 10 years after publication (published no later than 2011). To facilitate research on citation dynamics across different fields 48 , we have also used the same procedure to fit the citation sequences for papers that have received at least 10 citations within 10 years across all fields of study from the 1960s to the 1990s.
Sleeping beauty coefficient
Sometimes it may take years or even decades for papers to gain attention from the scientific community, a phenomenon known as the “Sleeping Beauty” in science 93 . The sleeping beauty coefficient B is defined as \({\rm{B}}={\sum }_{t=0}^{{t}_{m}}\frac{\frac{{c}_{{t}_{m}}-{c}_{0}}{{t}_{m}}\cdot t+{c}_{0}-{c}_{t}}{{\rm{\max }}\left(1,{c}_{t}\right)}\) , where the paper receives its maximum yearly citation \({c}_{{t}_{m}}\) in year t m and c 0 in the year of publication. Here we calculate the sleeping beauty coefficient from yearly citation records of a paper. We match the publication years for each citing-cited paper pair published in journals and then aggregate yearly citations since publication for each cited paper. Next, we group the “SciSciNet_PaperReferences” table by each cited paper and compute the coefficient B , along with the awakening time. As a result, we obtain 52,699,363 records with sleeping beauty coefficients for journal articles with at least one citation.
Novelty and conventionality
Research shows that the highest-impact papers in science tend to be grounded in exceptionally conventional combinations of prior work yet simultaneously feature an intrusion of atypical combinations 47 . Here following this work 47 , we calculate the novelty and conventionality score of each paper by computing the Z-score for each combination of journal pairs. We further calculate the distribution of journal pair Z-scores by traversing all possible duos of references cited by a particular paper. A paper’s median Z-score characterizes the median conventionality of the paper, whereas a paper’s 10 th percentile Z-score captures the tail novelty of the paper’s atypical combinations.
More specifically, we first use the information of publication years for each citing-cited paper pair both published in journals and shuffle the reference records within the citing-cited year group to generate 10 randomized citation networks, while controlling the naturally skewed citation distributions. We then traverse each focal paper published in the same year. We further aggregate the frequency of reference journal pairs for papers in the real citation network and 10 randomized citation networks, calculating the Z-score of each reference journal pair for papers published in the same year. Finally, for each focal paper, we obtain its 10 th percentile and median of the Z-scores distribution, yielding 44,143,650 publication records with novelty and conventionality measures for journal papers from 1950 to 2021.
Disruption score
Disruption index quantifies the extent to which a paper disrupts or develops the existing literature 20 , 51 . Disruption, or D , is calculated through citation networks. For a given paper, one can separate its future citations into two types. One type only cites the focal paper itself while ignoring all the references that the paper builds upon, and the other is to cite both the focal paper and its references. D is expressed as: \({\rm{D}}={{\rm{p}}}_{{\rm{i}}}-{{\rm{p}}}_{{\rm{j}}}=\frac{{n}_{i}-{n}_{j}}{{n}_{i}+{n}_{j}+{n}_{k}}\) , where n i is the number of subsequent works that only cite the focal paper, n j is the number of subsequent works that cite both the focal paper and its references, and n k is the number of subsequent works that cite the references of the focal paper only. Following this definition, we calculate the disruption scores for all the papers that have at least one forward and backward citation (48,581,274 in total).
The number of NSF and NIH supporting grants
For external linkages from scientific publications to upstream supporting funding sources, we calculate the number of NSF/NIH grants associated with each primary paper in SciSciNet.
The number of patent citations, Newsfeed mentions, Twitter mentions, and clinical trial citations
For external linkages from scientific publications to downstream public uses of science, we also calculate the number of citations each primary paper in SciSciNet received from domains that go beyond science, including patents from USPTO and EPO, news and social media mentions from Newsfeed and Twitter, and clinical trials from ClinicalTrials.gov.
Individual- and Institutional-level measures
Productivity.
Scientific productivity is a widely used measure for quantifying individual careers 9 , 15 . Here we aggregate the unique primary Paper ID in SciSciNet, after grouping the records in the “SciSciNet_PaperAuthorAffiliations” data table by Author ID or Affiliation ID and calculate the number of publications produced by the same author or affiliation.
H-index is a popular metric to estimate a researcher’s career impact. The index of a scientist is h , if h of her papers have at least h citations and each of the remaining papers have less than h citations 94 , 101 . Here we compile the full publication list associated with each author, sort these papers by their total number of citations in descending order, and calculate the maximum value that satisfies the condition above as the H-index. By repeating the same procedure on each research institution, we also provide an institution-level H-index as well.
Scientific impact
Building on our c 10 measure at the paper level, here we further calculate the average c 10 (< c 10 >) for each author and affiliation, which offers a proxy to individual and institutional level scientific impact. Similarly, we calculate the average log c 10 (<log c 10 >), which is closely related to the Q parameter 15 of individual scientific impact.
Here we group by Author and Affiliation ID in the “PaperAuthorAffiliations” table, and then aggregate c 10 and log c 10 (pre-calculated at the paper level) of all papers published by the same id. Following previous works 15 , 16 , 102 , to avoid taking logarithm of zeros, we increase c 10 by one when calculating the <log c 10 >.
Name-gender associations
The availability of big data also enables a range of studies focusing on gender disparities, ranging from scientific publications and careers 17 , 103 , 104 , 105 , 106 to collaboration patterns 25 , 107 and the effects of the pandemic on women scientists 45 , 108 , 109 , 110 . Here we apply the method from a recent statistical model 80 to infer author gender based on their first names in the original author table. The method feeds unique author names into a cultural consensus model of name-gender associations incorporating 36 separate sources across over 150 countries. Note that for all the 134,197,162 authors, 23.26% of the authors (31,224,458) have only the first initials, which are excluded from the inference. By fine-tuning the annotated names from these data sources following the original method, we obtain 409,809 unique names with max uncertainty threshold set to 0.26 and 85% of the sample classified. Finally, we merge these name-gender inference records into the original SciSciNet_Authors table, resulting a SciSciNet_Authors_Gender table, which contains 86,286,037 authors with inferred probability that indicates a name belongs to an individual gendered female, denoted as P(gf), as well as the number of inference source datasets and empirical counts. Together, by combining new statistical models with our systematic authorship information, this new table provides name-gender information, useful in studying gender-related questions. It is important to note that such name-based gender inference algorithms, including the one used here as well as other popular tools such as genderize.io , have limitations and are necessarily imperfect. The limitations should be considered carefully when applying these methods 96 .
Data Records
The data lake, SciSciNet, is freely available at Figshare 72 .
Data structure
Table 2 presents the size and descriptions of these data files.
Table 3 contains information about “SciSciNet_Papers”, which is the data lake’s primary paper table, containing information on the primary scientific publications, including Paper ID, DOI, and others, along with the Journal ID or Conference Series ID, which can link papers to corresponding journals or conference series that take place regularly. The short description in each data field includes the corresponding explanation of that field.
Tables 4 – 22 include the data fields and corresponding descriptions of each data table. Each data field specified is clear from its index name. An ID of the data field in a data table can be linked, if this field has the same ID name as another field in another table. Further, the data link tables provide linkages from scientific publications to external socioeconomic institutions. For example, the paper with primary “PaperID” as “246319838”, which studied the hereditary spastic paraplegia 111 , lead to three core NIH project number “R01NS033645”, “R01NS036177”, and “R01NS038713” in the Table 11 “SciSciNet_Link_NIH”. We can not only extract detailed information and metrics of the paper in the data lake (e.g., title from Table 8 “SciSciNet_PaperDetails”, or citation counts from the primary paper Table 3 “SciSciNet_Papers”) but also obtain further information of the funded-projects, such as the total funding amount, from NIH RePORTER ( https://report.nih.gov ).
Descriptive statistics
Next, we present a set of descriptive statistics derived from the data lake. Figure 3a–c show the distribution of papers across 19 top-level fields, the exponential growth of scientific publications in SciSciNet over time, and the average team size of papers by field over time.
Summary statistics of scientific publications in SciSciNet. ( a ) The number of publications in 19 top-level fields. For clarity we aggregated the field classification into the top level (e.g., a paper is counted as a physics paper if it is associated with physics or any other subfields of physics). ( b ) The exponential growth of science over time. ( c ) Average team size by field from 1950 to 2020. The bold black line is for papers in all the 19 top-level fields. Each colored line indicates each of the 19 fields (color coded according to (a)).
Building on the external linkages we constructed, Fig. 4a–f show the distribution of paper-level upstream funding sources from NIH and NSF, and downstream applications and mentions of science, including USPTO/EPO patents, clinical trials, news mentions from Newsfeed, and social media mentions from Twitter.
Linking scientific publications with socioeconomic institutions. Panels ( a, b and d, e ) show the distribution of paper-level downstream applications ( a : Twitter mentions; b : Newsfeed mentions; d : Patents; e : Clinical trials). Panels ( c and f ) show the distribution of supporting scientific grants from NIH ( c ) and NSF ( f ).
Figure 5 presents the probability distributions of various commonly used metrics in the science of science using our data lake, which are broadly consistent with the original studies in the literature.
Commonly used metrics in SciSciNet. ( a ) The distribution of disruption score for 48,581,274 papers 20 (50,000 bins in total). ( b ) Cumulative distribution function (CDF) of 44,143,650 journal papers’ 10 th percentile and median Z-scores 47 . ( c ) Distribution of \({e}^{{\rm{\langle }}log{c}_{\mathrm{10}}{\rm{\rangle }}}\) for scholars 15 with at least 10 publications in SciSciNet. The red line corresponds to a log-normal fit with μ = 2.14 and σ = 1.14. ( d ) Survival distribution function of sleeping beauty coefficients 93 for 52,699,363 papers, with a power-law fit: exponent α = 2.40. ( e ) Data collapse for a selected subset of papers with more than 30 citations within 30 years across journals in physics in the 1960s, based on WSB model 46 . The red line corresponds to the cumulative distribution function of the standard normal distribution.
Technical Validation
Validation of publication and citation records.
As we select the primary papers from the original MAG dataset, we have re-counted the citations and references within the subset of primary papers. To test the reliability of updated citation and reference counts in SciSciNet, here we compare the two versions (i.e., raw MAG counts and redistributed SciSciNet counts), by calculating the Spearman correlation coefficients for both citations and references. The Spearman correlation coefficients are 0.991 for citations and 0.994 for references, indicating that these metrics are highly correlated before and after the redistribution process.
We also examine the coverage of our publication data through a cross-validation with an external dataset, Dimensions 112 . By using DOI as a standardized identifier, we find that the two databases contain a similar number of papers, with 106,517,016 papers in Dimensions and 98,795,857 papers in SciSciNet associated with unique DOIs. We further compare the overlap of the two databases, finding the two data sources share a vast majority of papers in common (84,936,278 papers with common DOIs, accounting for 79.74% of Dimensions and 85.97% of SciSciNet).
Further, the citation information recorded by the two datasets appears highly consistent. Within the 84.9 M papers we matched with common DOIs, SciSciNet records a similar, yet slightly higher number of citations on average (16.75), compared with Dimensions (14.64). Our comparison also reveals a high degree of consistency in paper-level citation counts between the two independent corpora, with a Spearman correlation coefficient 0.946 and a concordance coefficient 98 , 113 of 0.940. Together, these validations provide further support for the coverage of the data lake.
Validation of external data linkages
We further perform additional cross-validation to understand the reliability of data linkages from scientific publications to external data sources. Here we focus more on the NSF-SciSciNet publications linkages we created from raw data collection to final data linkage. We also use the same approach to validate the NIH-SciSciNet publications linkages.
Here we compare the distribution and coverage of paper-grants linkages between SciSciNet and Dimensions—one of the state-of-the-art commercial databases in publication-grant linkages 112 . Figure 6a,b present the distribution of the number of papers matched to each NSF award and NIH grant, showing that our open-source approach offers a comparable degree of coverage. We further perform individual grant level analysis, by comparing the number of papers matched to each grant reported by the two sources (Fig. 6c,d ), again finding high degrees of consistency (Spearman correlation coefficient: 0.973 for NIH grants and 0.714 for NSF grants).
Validation of data linkages between SciSciNet and Dimensions. Panels ( a, b ), The distribution of number of papers matched to each NIH and NSF grant, respectively. Panels ( c, d ), The number of papers matched to each NIH and NSF grant, respectively. All panels are based on data in a 20-year period (2000–2020).
We further calculate the confusion matrices of linkage from SciSciNet and Dimensions. By connecting the two datasets through paper DOIs and NSF/NIH grant project numbers, we compare their overlaps and differences in grant-paper pairs. For NSF, the confusion matrix is shown in Table 23 . The two datasets provide a similar level of coverage, with Dimensions containing 670,770 pairs and SciSciNet containing 632,568 pairs. 78.9% pairs in Dimensions (and 83.7% pairs in SciSciNet) can be found in the other dataset, documenting a high degree of consistency between the two sources. While there are data links contained in Dimensions that are not in SciSciNet, we also find that there exists a similar amount of data records in SciSciNet but not in Dimensions. Table 24 shows the confusion matrix of NIH grant-paper pairs between the two datasets. Again, the two datasets share a vast majority of grant-paper pairs in common, and 95.3% pairs in Dimensions (and 99.7% pairs in SciSciNet) can also be found in the other dataset. These validations further support the overall quality and coverage of data linkages in SciSciNet.
Validation of calculations of commonly used measurements
We also seek to validate the calculated metrics included in SciSciNet. In addition to manual inspection of independent data samples during data processing, along with presenting the corresponding distributions of indicators in the Descriptive statistics section, which capture general patterns, we further double-check the calculation results of these popular measurements in SciSciNet by reproducing canonical results in the science of science under a series of standardized and transparent processes.
For disruption scores, we plot the median disruption percentile and average citations on different team sizes for 48,581,274 publications with at least one citation and reference record in SciSciNet. As shown in Fig. 7a , when team size increases, the disruption percentile decreases while the average citations increase, which is consistent with the empirical findings that small teams disrupt whereas large teams develop 20 . In addition, the probability of being among the top 5% disruptive publications is negatively correlated with the team size, while the probability of being among the most impactful publications increases is positively correlated with the team size (Fig. 7b ). These results demonstrate the consistency with results obtained in the literature.
Calculating commonly used measurements in the science of science literature. ( a, b ), Small teams disrupt while large teams develop in SciSciNet. ( c ), The cumulative distribution functions (CDFs) of proportion of external citations for papers with high (top 10,000, B > 307.55), medium (from 10,001 st to top 2% SBs, 33< B < = 307.55); and low (B < = 33) sleeping beauty indexes. ( d ), The probability of a 5% hit paper, conditional on novelty and conventionality for all journal articles in SciSciNet from 1950 to 2000.
The combinations of conventional wisdom and atypical knowledge tend to predict a higher citation impact 47 . Here we repeat the original analysis by categorizing papers based on (1) median conventionality: whether the median score of a paper is in the upper half and (2) tail novelty: whether the paper is within the top 10 th percentile of novelty score. We then identified hit papers (within the subset of our analysis), defined as papers rank in the top 5% of ten-year citations within the same top-level field and year. The four quadrants in Fig. 7d suggest that papers with high median conventionality and high tail novelty present a higher hit rate of 7.32%, within the selection of SciSciNet papers published from 1950 to 2000. Also, papers with high median conventionality but low tail novelty show a hit rate of 4.18%, roughly similar to the baseline rate of 5%, while those with low median conventionality but high tail novelty display a hit rate of 6.48%. Meanwhile, papers with both low median conventionality and low tail novelty exhibit a hit rate of 3.55%. These results are broadly consistent with the canonical results reported in 47 .
In Fig. 5e , we select 36,802 physics papers published in the 1960s with more than 30 citations within 30 years of publication. By rescaling their citation dynamics using the fitted parameters, we find a remarkable collapse of rescaled citation dynamics which appears robust across fields and decades. We further validate the predictive power of the model with prior based on Shen et al . 92 , by calculating the out-of-sample prediction accuracy. We find that with a training period of 15 years, the predictive accuracy (defined as a strict absolute tolerance threshold of 0.1) stays above 0.65 for 10 years after the training period, and the Mean Absolute Percentage Error (MAPE) is less than 0.1. The MAPE stays less than 0.15 for 20 years after the training period.
Sleeping beauty
We first fit the distribution of the sleeping beauty coefficients in SciSciNet (Fig. 5d ) to a power-law form using maximum likelihood estimation 114 , obtaining a power-law exponent α = 2.40 and minimum value B m = 23.59. By using fine-grained subfield information provided by MAG, we further calculate the proportion of external citations. Consistent with the original study 93 , we find that papers with high B scores are more likely to have a higher proportion of external citations from other fields (Fig. 7c ).
Usage Notes
Note that, recognizing the recent surge of interest in quantitative understanding of science 95 , 97 , 98 , 115 , 116 , the measurements currently covered in the data lake are not meant to be comprehensive; rather they serve as examples to illustrate how researchers from the broader community can collectively contribute and enrich the data lake. There are also limitations of the data lake that readers should keep in mind when using the data lake. For example, our grant-publication linkage is focused on scientific papers supported by NSF and NIH; patent-publication linkage is limited to citations from USPTO and EPO patents; clinical trial-publication linkage is derived from clinitrials.gov (where the geographical distribution may be heterogenous across countries, Table 25 ); and media-publication linkage is based on sources tracked by Crossref. Further, while our data linkages are based on state-of-the-art methods of data extraction and cleaning, as with any matching, the methods are necessarily imperfect and may be further improved through integration with complementary commercial products such as Altmetric and Dimensions. Finally, our data inherently represents a static snapshot, drawing primarily from the final edition of MAG (Dec 2021 version). While this snapshot is already sufficient in answering many of the research questions that arise in the field, future work may engage in continuous improvement and update of the data lake to maximize its potential.
Overall, this data lake serves as an initial step for serving the community in studying publications, funding, and broader impact. At the same time, there are also several promising directions for future work expanding the present effort. For example, the rapid development in natural language processing (NLP) models and techniques, accompanied by the increasing availability of text information from scientific articles, offers new opportunities to collect and curate more detailed content information. For example, one can link SciSciNet to other sources such as OpenAlex or Semantic Scholar to analyze large-scale data of abstract, full-text, or text-based embeddings. Such efforts will not only enrich the metadata associated with each paper, but also enable more precise identification and linkage of bio/chemical entities studied in these papers 117 . Further, although platforms like MAG have implemented advanced algorithms for name disambiguation and topic/field classification at scale, these algorithms are inherently imperfect and not necessarily consistent across datasets, hence it is essential to further validate and improve the accuracy of name disambiguation and topic classifications 118 . Related, in this paper we primarily focus on paper-level linkages across different datasets. Using these linkages as intermediary information, one can further construct and enrich individual-level profiles, allowing us to combine professional information (e.g., education background, grants, publications, and other broad impact) of researchers with important demographic dimensions (e.g., gender, age, race, and ethnicity). Finally, the data lake could contribute to an ecosystem for the collective community of the science of science. For example, there are synergies with the development of related programming packages, such as pySciSci 119 . By making the data lake fully open, we also hope it inspires other researchers to contribute to the data lake and enrich its coverage. For example, when a research team publishes a new measure, they could put out a data file that computes their measure based on SciSciNet, effectively adding a new column to the data lake. Lastly, science forms a complex social system and often offers an insightful lens to examine broader social science questions, suggesting that the SciSciNet may see greater utility by benefiting adjacent fields such as computational social science 120 , 121 , network science 122 , 123 , complex systems 124 , and more 125 .
Code availability
The source code for data selection and curation, data linkage, and metrics calculation is available at https://github.com/kellogg-cssi/SciSciNet .
Liu, L., Jones, B. F., Uzzi, B. & Wang, D. Measurement and Empirical Methods in the Science of Science. Nature Human Behaviour , https://doi.org/10.1038/s41562-023-01562-4 (2023).
Fortunato, S. et al . Science of science. Science 359 , eaao0185 (2018).
Article PubMed PubMed Central Google Scholar
Wang, D. & Barabási, A.-L. The science of science . (Cambridge University Press, 2021).
Zeng, A. et al . The science of science: From the perspective of complex systems. Physics reports 714 , 1–73 (2017).
Article ADS MathSciNet MATH Google Scholar
Azoulay, P. et al . Toward a more scientific science. Science 361 , 1194–1197 (2018).
Article ADS PubMed Google Scholar
Clauset, A., Larremore, D. B. & Sinatra, R. Data-driven predictions in the science of science. Science 355 , 477–480 (2017).
Article ADS CAS PubMed Google Scholar
Liu, L., Dehmamy, N., Chown, J., Giles, C. L. & Wang, D. Understanding the onset of hot streaks across artistic, cultural, and scientific careers. Nature communications 12 , 1–10 (2021).
ADS Google Scholar
Jones, B. F. The burden of knowledge and the “death of the renaissance man”: Is innovation getting harder? The Review of Economic Studies 76 , 283–317 (2009).
Article MATH Google Scholar
Way, S. F., Morgan, A. C., Clauset, A. & Larremore, D. B. The misleading narrative of the canonical faculty productivity trajectory. Proceedings of the National Academy of Sciences 114 , E9216–E9223, https://doi.org/10.1073/pnas.1702121114 (2017).
Article ADS CAS Google Scholar
Jones, B. F. & Weinberg, B. A. Age dynamics in scientific creativity. Proceedings of the National Academy of Sciences 108 , 18910–18914 (2011).
Malmgren, R. D., Ottino, J. M. & Amaral, L. A. N. The role of mentorship in protege performance. Nature 465 , 622–U117 (2010).
Article ADS CAS PubMed PubMed Central Google Scholar
Liénard, J. F., Achakulvisut, T., Acuna, D. E. & David, S. V. Intellectual synthesis in mentorship determines success in academic careers. Nature communications 9 , 1–13 (2018).
Article Google Scholar
Petersen, A. M. et al . Reputation and Impact in Academic Careers. Proceedings of the National Academy of Science USA 111 , 15316–15321 (2014).
Ma, Y., Mukherjee, S. & Uzzi, B. Mentorship and protégé success in STEM fields. Proceedings of the National Academy of Sciences 117 , 14077–14083 (2020).
Sinatra, R., Wang, D., Deville, P., Song, C. M. & Barabasi, A. L. Quantifying the evolution of individual scientific impact. Science 354 (2016).
Liu, L. et al . Hot streaks in artistic, cultural, and scientific careers. Nature 559 , 396–399 (2018).
Larivière, V., Ni, C., Gingras, Y., Cronin, B. & Sugimoto, C. R. Bibliometrics: Global gender disparities in science. Nature News 504 , 211 (2013).
Sugimoto, C. R. et al . Scientists have most impact when they’re free to move. Nature 550 , 29–31 (2017).
Wuchty, S., Jones, B. F. & Uzzi, B. The increasing dominance of teams in production of knowledge. Science 316 , 1036–1039 (2007).
Wu, L., Wang, D. & Evans, J. A. Large teams develop and small teams disrupt science and technology. Nature 566 , 378–382, https://doi.org/10.1038/s41586-019-0941-9 (2019).
Milojevic, S. Principles of scientific research team formation and evolution. Proceedings of the National Academy of Sciences 111 , 3984–3989 (2014).
Newman, M. E. The structure of scientific collaboration networks. Proceedings of the National Academy of Sciences 98 , 404–409 (2001).
Article ADS MathSciNet CAS MATH Google Scholar
AlShebli, B. K., Rahwan, T. & Woon, W. L. The preeminence of ethnic diversity in scientific collaboration. Nature communications 9 , 1–10 (2018).
Article CAS Google Scholar
Shen, H.-W. & Barabási, A.-L. Collective credit allocation in science. Proceedings of the National Academy of Sciences 111 , 12325–12330 (2014).
Leahey, E. From Sole Investigator to Team Scientist: Trends in the Practice and Study of Research Collaboration. Annual Review of Sociology, Vol 42 42 , 81–100 (2016).
Clauset, A., Arbesman, S. & Larremore, D. B. Systematic inequality and hierarchy in faculty hiring networks. Science advances 1 , e1400005 (2015).
Article ADS PubMed PubMed Central Google Scholar
Jones, B. F., Wuchty, S. & Uzzi, B. Multi-university research teams: shifting impact, geography, and stratification in science. science 322 , 1259–1262 (2008).
Deville, P. et al . Career on the move: Geography, stratification, and scientific impact. Scientific reports 4 (2014).
Chu, J. S. & Evans, J. A. Slowed canonical progress in large fields of science. Proceedings of the National Academy of Sciences 118 (2021).
Azoulay, P., Fons-Rosen, C. & Graff Zivin, J. S. Does science advance one funeral at a time? American Economic Review 109 , 2889–2920 (2019).
Article PubMed Google Scholar
Jin, C., Ma, Y. & Uzzi, B. Scientific prizes and the extraordinary growth of scientific topics. Nature communications 12 , 1–11 (2021).
Nagaraj, A., Shears, E. & de Vaan, M. Improving data access democratizes and diversifies science. Proceedings of the National Academy of Sciences 117 , 23490–23498 (2020).
Evans, J. A. & Reimer, J. Open access and global participation in science. Science 323 , 1025–1025 (2009).
Peng, H., Ke, Q., Budak, C., Romero, D. M. & Ahn, Y.-Y. Neural embeddings of scholarly periodicals reveal complex disciplinary organizations. Science Advances 7 , eabb9004 (2021).
Ahmadpoor, M. & Jones, B. F. The dual frontier: Patented inventions and prior scientific advance. Science 357 , 583–587 (2017).
Yin, Y., Gao, J., Jones, B. F. & Wang, D. Coevolution of policy and science during the pandemic. Science 371 , 128–130 (2021).
Ding, W. W., Murray, F. & Stuart, T. E. Gender differences in patenting in the academic life sciences. science 313 , 665–667 (2006).
CAS PubMed Google Scholar
Bromham, L., Dinnage, R. & Hua, X. Interdisciplinary research has consistently lower funding success. Nature 534 , 684 (2016).
Larivière, V., Vignola-Gagné, E., Villeneuve, C., Gélinas, P. & Gingras, Y. Sex differences in research funding, productivity and impact: an analysis of Québec university professors. Scientometrics 87 , 483–498 (2011).
Li, D., Azoulay, P. & Sampat, B. N. The applied value of public investments in biomedical research. Science 356 , 78–81 (2017).
Fleming, L., Greene, H., Li, G., Marx, M. & Yao, D. Government-funded research increasingly fuels innovation. Science 364 , 1139–1141, https://doi.org/10.1126/science.aaw2373 (2019).
Lazer, D. M. et al . The science of fake news. Science 359 , 1094–1096 (2018).
Scheufele, D. A. & Krause, N. M. Science audiences, misinformation, and fake news. Proceedings of the National Academy of Sciences 116 , 7662–7669 (2019).
Kreps, S. E. & Kriner, D. L. Model uncertainty, political contestation, and public trust in science: Evidence from the COVID-19 pandemic. Science advances 6 , eabd4563 (2020).
Myers, K. R. et al . Unequal effects of the COVID-19 pandemic on scientists. Nature Human Behaviour https://doi.org/10.1038/s41562-020-0921-y (2020).
Wang, D. S., Song, C. M. & Barabasi, A. L. Quantifying Long-Term Scientific Impact. Science 342 , 127–132 (2013).
Uzzi, B., Mukherjee, S., Stringer, M. & Jones, B. Atypical combinations and scientific impact. Science 342 , 468–472 (2013).
Radicchi, F., Fortunato, S. & Castellano, C. Universality of citation distributions: Toward an objective measure of scientific impact. Proceedings of the National Academy of Sciences 105 , 17268–17272 (2008).
de Solla Price, D. J. Networks of Scientific Papers. Science 149 , 510–515 (1965).
Article ADS Google Scholar
Price, D. d. S. A general theory of bibliometric and other cumulative advantage processes. Journal of the American society for Information science 27 , 292–306 (1976).
Funk, R. J. & Owen-Smith, J. A Dynamic Network Measure of Technological Change. Management Science 63 , 791–817 (2017).
Thelwall, M., Haustein, S., Larivière, V. & Sugimoto, C. R. Do altmetrics work? Twitter and ten other social web services. PloS one 8 (2013).
Wang, R. et al . in Proceedings of the 27th ACM International Conference on Information and Knowledge Management 1487–1490 (Association for Computing Machinery, Torino, Italy, 2018).
Tan, Z. et al . in Proceedings of the 25th International Conference Companion on World Wide Web 437–442 (International World Wide Web Conferences Steering Committee, Montréal, Québec, Canada, 2016).
Yin, Y., Dong, Y., Wang, K., Wang, D. & Jones, B. F. Public use and public funding of science. Nature Human Behaviour https://doi.org/10.1038/s41562-022-01397-5 (2022).
Wu, J. et al . CiteSeerX: AI in a Digital Library Search Engine. AI Magazine 36 , 35–48, https://doi.org/10.1609/aimag.v36i3.2601 (2015).
Wan, H., Zhang, Y., Zhang, J. & Tang, J. AMiner: Search and Mining of Academic Social Networks. Data Intelligence 1 , 58–76, https://doi.org/10.1162/dint_a_00006 (2019).
Zhang, Y., Zhang, F., Yao, P. & Tang, J. in Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining . 1002–1011.
Hendricks, G., Tkaczyk, D., Lin, J. & Feeney, P. Crossref: The sustainable source of community-owned scholarly metadata. Quantitative Science Studies 1 , 414–427 (2020).
Priem, J., Piwowar, H. & Orr, R. OpenAlex: A fully-open index of scholarly works, authors, venues, institutions, and concepts. arXiv preprint arXiv:2205.01833 (2022).
Sinha, A. et al . in Proceedings of the 24th International Conference on World Wide Web 243–246 (Association for Computing Machinery, Florence, Italy, 2015).
Wang, K. et al . A Review of Microsoft Academic Services for Science of Science Studies. Frontiers in Big Data 2 , 45 (2019).
Wang, K. et al . Microsoft Academic Graph: When experts are not enough. Quantitative Science Studies 1 , 396–413 (2020).
Pinski, G. & Narin, F. Citation influence for journal aggregates of scientific publications: Theory, with application to the literature of physics. Information processing & management 12 , 297–312 (1976).
Carpenter, M. P., Cooper, M. & Narin, F. Linkage between basic research literature and patents. Research Management 23 , 30–35 (1980).
Narin, F., Hamilton, K. S. & Olivastro, D. The increasing linkage between US technology and public science. Research policy 26 , 317–330 (1997).
Marx, M. & Fuegi, A. Reliance on science: Worldwide front‐page patent citations to scientific articles. Strategic Management Journal 41 , 1572–1594 (2020).
Marx, M. & Fuegi, A. Reliance on science by inventors: Hybrid extraction of in‐text patent‐to‐article citations. Journal of Economics & Management Strategy (2020).
de Solla Price, D. Little science, big science . (Columbia University Press, 1963).
Sinatra, R., Deville, P., Szell, M., Wang, D. & Barabási, A.-L. A century of physics. Nature Physics 11 , 791–796 (2015).
de Solla Price, D. Science since babylon . (Yale University Press, 1961).
Lin, Z., Yin, Y., Liu, L. & Wang, D. SciSciNet: A large-scale open data lake for the science of science research, Figshare , https://doi.org/10.6084/m9.figshare.c.6076908.v1 (2022).
Microsoft Academic. Microsoft Academic Graph. Zenodo , https://doi.org/10.5281/zenodo.6511057 (2022).
Smalheiser, N. R. & Torvik, V. I. Author name disambiguation. Annual review of information science and technology 43 , 1–43 (2009).
Tang, J., Fong, A. C., Wang, B. & Zhang, J. A unified probabilistic framework for name disambiguation in digital library. IEEE Transactions on Knowledge and Data Engineering 24 , 975–987 (2011).
Ferreira, A. A., Gonçalves, M. A. & Laender, A. H. A brief survey of automatic methods for author name disambiguation. Acm Sigmod Record 41 , 15–26 (2012).
Sanyal, D. K., Bhowmick, P. K. & Das, P. P. A review of author name disambiguation techniques for the PubMed bibliographic database. Journal of Information Science 47 , 227–254 (2021).
Morrison, G., Riccaboni, M. & Pammolli, F. Disambiguation of patent inventors and assignees using high-resolution geolocation data. Scientific data 4 , 1–21 (2017).
Tekles, A. & Bornmann, L. Author name disambiguation of bibliometric data: A comparison of several unsupervised approaches1. Quantitative Science Studies 1 , 1510–1528, https://doi.org/10.1162/qss_a_00081 (2020).
Van Buskirk, I., Clauset, A. & Larremore, D. B. An Open-Source Cultural Consensus Approach to Name-Based Gender Classification. arXiv preprint arXiv:2208.01714 (2022).
Cleary, E. G., Beierlein, J. M., Khanuja, N. S., McNamee, L. M. & Ledley, F. D. Contribution of NIH funding to new drug approvals 2010–2016. Proceedings of the National Academy of Sciences 115 , 2329–2334 (2018).
Packalen, M. & Bhattacharya, J. NIH funding and the pursuit of edge science. Proceedings of the National Academy of Sciences 117 , 12011–12016, https://doi.org/10.1073/pnas.1910160117 (2020).
Wang, Y., Jones, B. F. & Wang, D. Early-career setback and future career impact. Nature communications 10 , 1–10 (2019).
Hechtman, L. A. et al . NIH funding longevity by gender. Proceedings of the National Academy of Sciences 115 , 7943–7948 (2018).
Agrawal, A. & Henderson, R. Putting patents in context: Exploring knowledge transfer from MIT. Management science 48 , 44–60 (2002).
Bekkers, R. & Freitas, I. M. B. Analysing knowledge transfer channels between universities and industry: To what degree do sectors also matter? Research policy 37 , 1837–1853 (2008).
Owen-Smith, J. & Powell, W. W. To patent or not: Faculty decisions and institutional success at technology transfer. The Journal of Technology Transfer 26 , 99–114 (2001).
Mowery, D. C. & Shane, S. Introduction to the special issue on university entrepreneurship and technology transfer. Management Science 48 , v–ix (2002).
Williams, R. S., Lotia, S., Holloway, A. K. & Pico, A. R. From Scientific Discovery to Cures: Bright Stars within a Galaxy. Cell 163 , 21–23, https://doi.org/10.1016/j.cell.2015.09.007 (2015).
Article CAS PubMed Google Scholar
Hmielowski, J. D., Feldman, L., Myers, T. A., Leiserowitz, A. & Maibach, E. An attack on science? Media use, trust in scientists, and perceptions of global warming. Public Understanding of Science 23 , 866–883 (2014).
Li, J., Yin, Y., Fortunato, S. & Wang, D. A dataset of publication records for Nobel laureates. Scientific data 6 , 33 (2019).
Shen, H., Wang, D., Song, C. & Barabási, A.-L. in Proceedings of the AAAI Conference on Artificial Intelligence .
Ke, Q., Ferrara, E., Radicchi, F. & Flammini, A. Defining and identifying Sleeping Beauties in science. Proceedings of the National Academy of Sciences , 201424329 (2015).
Hirsch, J. E. An index to quantify an individual’s scientific research output. Proceedings of the National academy of Sciences of the United States of America 102 , 16569–16572 (2005).
Article ADS CAS PubMed PubMed Central MATH Google Scholar
Waltman, L., Boyack, K. W., Colavizza, G. & van Eck, N. J. A principled methodology for comparing relatedness measures for clustering publications. Quantitative Science Studies 1 , 691–713, https://doi.org/10.1162/qss_a_00035 (2020).
Santamaría, L. & Mihaljević, H. Comparison and benchmark of name-to-gender inference services. PeerJ Computer Science 4 , e156 (2018).
Bornmann, L. & Williams, R. An evaluation of percentile measures of citation impact, and a proposal for making them better. Scientometrics 124 , 1457–1478, https://doi.org/10.1007/s11192-020-03512-7 (2020).
Haunschild, R., Daniels, A. D. & Bornmann, L. Scores of a specific field-normalized indicator calculated with different approaches of field-categorization: Are the scores different or similar? Journal of Informetrics 16 , 101241, https://doi.org/10.1016/j.joi.2021.101241 (2022).
Yin, Y. & Wang, D. The time dimension of science: Connecting the past to the future. Journal of Informetrics 11 , 608–621 (2017).
Stringer, M. J., Sales-Pardo, M. & Amaral, L. A. N. Statistical validation of a global model for the distribution of the ultimate number of citations accrued by papers published in a scientific journal. Journal of the American Society for Information Science and Technology 61 , 1377–1385 (2010).
Bornmann, L. & Daniel, H.-D. What do we know about the h index? Journal of the American Society for Information Science and Technology 58 , 1381–1385, https://doi.org/10.1002/asi.20609 (2007).
Li, J., Yin, Y., Fortunato, S. & Wang, D. Nobel laureates are almost the same as us. Nature Reviews Physics 1 , 301 (2019).
Abramo, G., D’Angelo, C. & Caprasecca, A. Gender differences in research productivity: A bibliometric analysis of the Italian academic system. Scientometrics 79 , 517–539 (2009).
Huang, J., Gates, A. J., Sinatra, R. & Barabási, A.-L. Historical comparison of gender inequality in scientific careers across countries and disciplines. Proceedings of the National Academy of Sciences 117 , 4609–4616 (2020).
Dworkin, J. D. et al . The extent and drivers of gender imbalance in neuroscience reference lists. Nature neuroscience 23 , 918–926 (2020).
Squazzoni, F. et al . Peer review and gender bias: A study on 145 scholarly journals. Science advances 7 , eabd0299 (2021).
Yang, Y., Tian, T. Y., Woodruff, T. K., Jones, B. F. & Uzzi, B. Gender-diverse teams produce more novel and higher-impact scientific ideas. Proceedings of the National Academy of Sciences 119 , e2200841119 (2022).
Squazzoni, F. et al . Only second-class tickets for women in the COVID-19 race. A study on manuscript submissions and reviews in 2329 Elsevier journals. A study on manuscript submissions and reviews in 2329 (2020).
Vincent-Lamarre, P., Sugimoto, C. R. & Larivière, V. The decline of women’s research production during the coronavirus pandemic. Nature index 19 (2020).
Staniscuaski, F. et al . Gender, race and parenthood impact academic productivity during the COVID-19 pandemic: from survey to action. Frontiers in psychology 12 , 663252 (2021).
Fink, J. K. Hereditary spastic paraplegia. Neurologic Clinics 20 , 711–726, https://doi.org/10.1016/S0733-8619(02)00007-5 (2002).
Herzog, C., Hook, D. & Konkiel, S. Dimensions: Bringing down barriers between scientometricians and data. Quantitative Science Studies 1 , 387–395 (2020).
Lawrence, I. & Lin, K. A concordance correlation coefficient to evaluate reproducibility. Biometrics , 255–268 (1989).
Clauset, A., Shalizi, C. R. & Newman, M. E. Power-law distributions in empirical data. SIAM review 51 , 661–703 (2009).
Bornmann, L. & Wohlrabe, K. Normalisation of citation impact in economics. Scientometrics 120 , 841–884, https://doi.org/10.1007/s11192-019-03140-w (2019).
van Eck, N. J. & Waltman, L. Citation-based clustering of publications using CitNetExplorer and VOSviewer. Scientometrics 111 , 1053–1070, https://doi.org/10.1007/s11192-017-2300-7 (2017).
Xu, J. et al . Building a PubMed knowledge graph. Scientific Data 7 , 205, https://doi.org/10.1038/s41597-020-0543-2 (2020).
Torvik, V. I. & Smalheiser, N. R. Author name disambiguation in MEDLINE. ACM Transactions on Knowledge Discovery from Data (TKDD) 3 , 1–29 (2009).
Reproducible Science of Science at scale: pySciSci Abstract Quantitative Science Studies 1-17, https://doi.org/10.1162/qss_a_00260 .
Lazer, D. M. et al . Computational social science: Obstacles and opportunities. Science 369 , 1060–1062 (2020).
Lazer, D. et al . Computational social science. Science 323 , 721–723 (2009).
Article CAS PubMed PubMed Central Google Scholar
Barabási, A.-L. Network science . (Cambridge University, 2015).
Newman, M. Networks: an introduction . (Oxford University Press, 2010).
Castellano, C., Fortunato, S. & Loreto, V. Statistical physics of social dynamics. Reviews of modern physics 81 , 591 (2009).
Dong, Y., Ma, H., Shen, Z. & Wang, K. in Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining . 1437–1446 (ACM).
Download references
Acknowledgements
The authors thank Alanna Lazarowich, Krisztina Eleki, Jiazhen Liu, Huawei Shen, Benjamin F. Jones, Brian Uzzi, Alex Gates, Daniel Larremore, YY Ahn, Lutz Bornmann, Ludo Waltman, Vincent Traag, Caroline Wagner, and all members of the Center for Science of Science and Innovation (CSSI) at Northwestern University for their help. This work is supported by the Air Force Office of Scientific Research under award number FA955017-1-0089 and FA9550-19-1-0354, National Science Foundation grant SBE 1829344, the Alfred P. Sloan Foundation G-2019-12485, and Peter G. Peterson Foundation 21048. The funders had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript.
Author information
Authors and affiliations.
Center for Science of Science and Innovation, Northwestern University, Evanston, IL, USA
Zihang Lin, Yian Yin, Lu Liu & Dashun Wang
Northwestern Institute on Complex Systems, Northwestern University, Evanston, IL, USA
Kellogg School of Management, Northwestern University, Evanston, IL, USA
School of Computer Science, Fudan University, Shanghai, China
McCormick School of Engineering, Northwestern University, Evanston, IL, USA
Yian Yin & Dashun Wang
You can also search for this author in PubMed Google Scholar
Contributions
D.W. and Y.Y. conceived the project and designed the experiments; Z.L. and Y.Y. collected the data; Z.L. performed data pre-processing, statistical analyses, and validation with help from Y.Y., L.L. and D.W.; Z.L., Y.Y. and D.W. wrote the manuscript; all authors edited the manuscript.
Corresponding author
Correspondence to Dashun Wang .
Ethics declarations
Competing interests.
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Supplementary information, rights and permissions.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/ .
Reprints and permissions
About this article
Cite this article.
Lin, Z., Yin, Y., Liu, L. et al. SciSciNet: A large-scale open data lake for the science of science research. Sci Data 10 , 315 (2023). https://doi.org/10.1038/s41597-023-02198-9
Download citation
Received : 13 July 2022
Accepted : 02 May 2023
Published : 01 June 2023
DOI : https://doi.org/10.1038/s41597-023-02198-9
Share this article
Anyone you share the following link with will be able to read this content:
Sorry, a shareable link is not currently available for this article.
Provided by the Springer Nature SharedIt content-sharing initiative
This article is cited by
Publication, funding, and experimental data in support of human reference atlas construction and usage.
- Yongxin Kong
- Katy Börner
Scientific Data (2024)
Women’s strength in science: exploring the influence of female participation on research impact and innovation
- Wenxuan Shi
Scientometrics (2024)
Gender assignment in doctoral theses: revisiting Teseo with a method based on cultural consensus theory
- Nataly Matias-Rayme
- Iuliana Botezan
- Rodrigo Sánchez-Jiménez
Unveiling the dynamics of team age structure and its impact on scientific innovation
- Alex J. Yang
Measurement of disruptive innovation and its validity based on improved disruption index
- Ziyan Zhang
- Junyan Zhang
Quick links
- Explore articles by subject
- Guide to authors
- Editorial policies
Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.
STS 102: Artificial Intelligence in Society
Find articles via subject databases, how to choose a database.
In deciding which database(s) to use, it is helpful to note:
- Who: Who is authoring these publications? Are these scholarly, popular, or industry sources?
- What : What can I find in the database? (e.g. articles, conference proceedings, data)
- When : When does coverage begin? How well is historical literature covered? Does it include articles published in the last year?
- Where : What is the geographical scope of the coverage? Does that match your research interest?
Recommended Subject Databases:
- Business & Economics
- Computer Science & Engineering
- Human Health & Behavior
- Philosophy & Cultural Studies
- Political Science & Sociology
- Arts & Humanities
- Business Source Complete This link opens in a new window Indexing and abstracts for the most important scholarly business journals back as far as 1886 are included. In addition to the searchable cited references provided for more than 1,200 journals, Business Source Complete contains detailed author profiles for the 25,000 most-cited authors in the database. Market research reports, industry reports, country reports, company profiles and SWOT analyses are also included. [Coverage: coverage varies by title with some going as far back as 1886]
- EconLit This link opens in a new window A comprehensive, indexed bibliography with selected abstracts of the world's economic literature compiled from the American Economic Association's Journal of economic literature and the Index of economic articles in journals and collective volumes. Topics include economic theory and history, monetary theory and financial institutions; labor economics; international, regional, urban economics; and other related subjects. [Coverage: 1969-present]
- Compendex This link opens in a new window Compendex is the broadest and most complete engineering literature database available in the world with over 22 million indexed records from 77 countries across 190 engineering disciplines. [Coverage: 1884-present] more... less... Engineering Research Profile Help
- INSPEC This link opens in a new window Inspec was created by the Institution of Engineering and Technology (IET), and is one of the world's most definitive bibliographic scientific engineering research databases, containing over 15 million abstracts and indexing records. Inspec is on the Engineering Village platform and can be searched together with Ei Compendex. By searching both engineering research databases together, engineers gain access to the broadest engineering source available with a single database search experience. [Coverage: 1898-present] more... less... INSPEC Analytics
- Web of Science Core Collection This link opens in a new window Web of Science Core Collection enables searching of top-cited peer-reviewed content across the sciences, social sciences, and humanities with "cited reference" search capabilities. "It is a curated collection of over 20,000 peer-reviewed, high-quality scholarly journals published worldwide (including Open Access journals) in over 250 science, social sciences, and humanities disciplines. Conference proceedings and book data are also available." There is also access to Journal Citation Reports which provide impact metrics like the Journal Impact Factor (JIF) and Eigenfactor Scoring. Web of Science also has article, author and institutional citation indices. Includes EndNote Basic online citation management tool. [Coverage: 1900-present] more... less... Web of Science Help Web of Science Training Resources: Getting Started Web of Science Technical Support Web of Science Core Collection - Quick Reference Guide
- BIOSIS Previews This link opens in a new window BIOSIS Previews is a database for researching the biological sciences literature. Designed by biologists for keeping up with the literature across pure and applied life sciences including agriculture and medicine. Excellent features for searching by taxonomic categories and broad concept codes (subject categories). More than 27 million records in all life science areas, including agriculture, biochemistry, biomedicine, biotechnology, ecology, environmental biology, genetics, microbiology, plant biology, veterinary medicine & pharmacology, and zoology. Indexes over 6,000 journals, serials, books and book chapters, conference proceedings and patents. [Coverage: 1926-present] more... less... BIOSIS Previews Help: Basic Search BIOSIS Previews Help
- Philosopher's Index This link opens in a new window The Philosopher's Index is a comprehensive, bibliographic database covering worldwide research in all areas of philosophy. It is created by philosophers for philosophers. The Philosopher's Index features: the inclusion of documents from philosophy and interdisciplinary publications, extensive indexing, and author-written abstracts. Philosophers prescreen potential source documents for relevance to the field of philosophy; this enables the inclusion of articles from interdisciplinary journals and contributions from interdisciplinary anthologies that pertain to philosophy. The indexing includes the assignment of subject headings that encompass proper names and subject terms using a standardized thesaurus. The indexing improves the quality of search results and the use of the standardized thesaurus provides consistency across records from the various publications. The abstracts help the user more quickly ascertain the relevance of the documents. The database provides global coverage, with source publications from more than 135 countries, and has records from 1940 to present, with additional records dating back to 1902. It includes more than 530,000 records in 37 languages. Sources includes: journals (print and e-journal articles from more than 1600 philosophy and interdisciplinary journals); books/monographs, including encyclopedias, dictionaries and book series; anthologies; contributions to anthologies from philosophy and interdisciplinary anthologies; and book reviews. [Coverage: 1940-present]
- Stanford Encyclopedia of Philosophy Designed as a dynamic reference work, this continuously updated encyclopedia features entries maintained and updated by an expert or group of experts in the field. All entries and updates are refereed by the members of an Editorial Board before they are made public. For citation, fixed editions are made on a quarterly basis and stored archivally. Note: Open (unrestricted public) access to the Encyclopedia has been made possible, in part, with a financial contribution from the University of California Libraries.
- Social Sciences Research Network / Economic Research Network (SSRN/ERN) This link opens in a new window A world wide collaborative of scholars that is devoted dissemination of social science research. It is composed of a number of specialized research networks in each of the social sciences.
- Sociological Abstracts This link opens in a new window Sociological Abstracts, and its companion file Social Services Abstracts, cover the international literature of sociology, social work, and related disciplines in the social and behavioral sciences. It provides abstracting and indexing of articles and book reviews drawn from thousands of serials publications, plus books, book chapters, dissertations, conference papers, and working papers. [Coverage: 1952-present]
- MLA International Bibliography This link opens in a new window The Modern Language Association International Bibliography (MLAIB) covers international scholarly materials on all languages, literatures, linguistics, and folklore from around the world. It includes citations to items from journals, series, books, essay collections, working papers, proceedings, dissertations, and bibliographies. MLAIB does not index book reviews. [Coverage: 1926-present]
How to Access the Full Text of an Article
"get it at uc" button.
Interlibrary Loan Request
If we do not own a journal or book, you can submit an interlibrary loan (ILL) request to have the book or article (e)mailed to you for free from another UC library. Learn more about how to request books or articles .
- << Previous: Books/e-Books
- Next: Popular Science News >>
- Getting Started
- Background Sources
- Encyclopedias
- Lectures & Podcasts
- Governance & Policy
- Books/e-Books
- Research Databases
- Popular Science News
- Google Scholar
- Off-Campus Access
Research Support
- Last Updated: Sep 27, 2024 3:51 PM
- URL: https://guides.library.ucdavis.edu/sts102
An official website of the United States government
Here’s how you know
Official websites use .gov A .gov website belongs to an official government organization in the United States.
Secure .gov websites use HTTPS A lock ( Lock A locked padlock ) or https:// means you’ve safely connected to the .gov website. Share sensitive information only on official, secure websites.
https://www.nist.gov/srd/online-scientific-databases
Standard Reference Data
Online scientific databases.
AnthroKids - Anthropometric Data of Children displays the results of two studies on anthropometric data of children.
Atlas of the Spectrum of a Platinum/Neon Hollow-Cathode Lamp in the Region 1130-4330 Å contains wavelengths and intensities for about 5600 lines in the region 4330 Å. An atlas plot of the spectrum is given, with the spectral lines marked and their intensities, wavelengths, and classifications listed.
Atomic Energy Levels and Wavelengths References contained in this database are from Bibliography on Atomic Energy Levels and Spectra, NBS Special Publication 363 and Supplements, as well as current references since the last published bibliography collected by the NIST Atomic Spectroscopy Data Center. These references pertain to atomic structure and spectra that arise from interactions or excitations involving electrons in the outer shells of free atoms and atomic ions, or from inner shell excitations corresponding to frequencies up to the soft x-ray range.
Periodic Table: Atomic Properties of the Elements A periodic table, containing NIST critically-evaluated data on atomic properties of the elements [SP 966] was designed as a NIST handout for use at exhibitions and trade shows. The publication of the handout coincided with NIST's centennial celebration in 2001. One side of the handout (shown below) is available online in two formats (PDF & TIFF), and is suitable for high-resolution color printing for desk or wall-chart display. [The other side of the handout (not available online) contains historical information.]
Atomic Reference Data for Electronic Structure Calculations contains total energies and orbital eigenvalues for the atoms hydrogen through uranium, as computed in several standard variants of density-functional theory.
The NIST Atomic Spectra Database (ASD) contains data for radiative transitions and energy levels in atoms and atomic ions. Data are included for observed transitions of 99 elements and energy levels of 52 elements. ASD contains data on about 900 spectra from about 1 Å (Ångströms) to 200 µm (micrometers), with about 70,000 energy levels and 91,000 lines, 40,000 of which have transition probabilities listed. The most current NIST-evaluated data associated with each transition are integrated under a single listing.
Atomic Spectral Line Broadening Bibliographic Database contains approximately 800 recent references, all collected after the last published bibliography. The database contains number data, general information, comments, and review articles.
Atomic Transition Probability Bibliographic Database provides 6,113 references from 1914 through 1999. These papers contain numerical data, comments and review articles on atomic transition probabilities (oscillator strengths, line strengths, or radiactive lifetimes).
Atomic Weights and Isotopic Compositions the atomic weights are available for elements 1 through 111, and isotopic compositions or abundances are given when appropriate.
Bibliography of Photon Total Cross Section (Attenuation Coefficient) Measurements includes papers reporting absolute measurements of photon (XUV, x-ray, gamma-ray, bremsstrahlung) total interaction cross sections or attenuation coefficients for the elements and some compounds.
NIST Biofuel Database brings together structural, biological, and thermodynamic data for enzymes that are either in current use or are being considered for use in the production of biofuels.
The Biological Macromolecule Crystallization Database and NASA Archive for Protein Crystal Growth Data (BMCD) contains the conditions reported for the crystallization of proteins and nucleic acids used in X-ray structure determinations and archives the results of microgravity macromolecule crystallization studies.
The Ceramics WebBook is an gateway to evaluated data, a guide to data centers and sources and a repository for tools and resources for ceramics.
The NIST Chemical Kinetics Database includes essentially all reported kinetics results for thermal gas-phase chemical reactions. The database is designed to be searched for kinetics data based on the specific reactants involved, for reactions resulting in specified products, for all the reactions of a particular species, or for various combinations of these. In addition, the bibliography can be searched by author name or combination of names. The database contains in excess of 38,000 separate reaction records for over 11,700 distinct reactant pairs. These data have been abstracted from over 12,000 papers with literature coverage through early 2000.
The NIST Chemistry WebBook , sixth edition contains thermochemical data for over 6500 organic and small inorganic compounds, reaction thermochemistry data for over 9800 reactions, IR spectra for over 8700 compounds, mass spectra for over 12,600 compounds, UV/Vis spectra for over 400 compounds, electronic / vibrational spectra for over 4100 compounds, constants of diatomic molecules (spectroscopic data) for over 600 compounds, ion energetics data for over 16,000 compounds, and thermophysical property data for 33 fluids. There are many avenues for searching the database.
CIS2 Visual Interoperability Testbed translates a CIS2 file (CIMsteel Integrations Standards) of a steel structure into a 3D interactive model in the form of a VRML (Virtual Reality Modeling Language) file. The translator is part of the research in developing a mapping between VRML prototypes and CIS2.
CKMech Chemical Kinetics Mechanisms was developed to make available data in a report entitled Thermochemical and Chemical Kinetic Data for Fluorinated Hydrocarbons.
The NIST Computational Chemistry Comparison and Benchmark Database is a collection of experimental and ab initio thermochemical properties for a selected set of molecules. Users are provide a benchmark set of molecules for the evaluation of ab initio computational methods and allow the comparison between different ab initio computational methods for the prediction of thermochemical properties.
Dictionary of Algorithms and Data Structures is a dictionary of algorithms, algorithmic techniques, data structures, archetypical problems, and related definitions. Some entries have links to further information and implementations.
The Digital Library of Mathematical Functions (DLMF) is a compendium of essential properties of the special functions of applied mathematics, which are ubiquitous in mathematical modeling and scientific computing. The DLMF includes references to proof sources for every formula, descriptions of relevant mathematical techniques, illustrative graphics , and links to online research literature, algorithms and software. The DLMF is updated periodically to cite or include new published results.
Electron Interactions with Plasma Processing Gases has data excerpted from recently published articles in the Journal of Physical and Chemical Reference Data (JPCRD). Presented here are collision cross section data and electron transport coefficients for gases, such as CF4 and CHF3, used in the manufacturing of semiconductor devices.
The Elemental Data Index provides access to the holdings of the NIST Physics Laboratory online data organized by element. It is intended to simplify the process of retrieving online scientific data for a specific element.
Engineering Statistics Handbook details numerous methods to help scientists and engineers incorporate statistical methods in their work.
Fire Research Information Services (FRIS) is a resource for fire protection engineers, scientists, and fire service personal. It includes FIREDOC which is a complete fire research bibliographic database with 55,000 holdings (published reports, journal articles, conference proceedings, books, and audiovisual items) of FRIS. Each reference has complete bibliographic information.
FLYCHK Collisional-Radiative Code FLYCHK provides a capability to generate atomic level populations and charge state distributions for low-Z to mid-Z elements under NLTE conditions.
Frequencies for Interstellar Molecular Microwave Transitions presents critically evaluated transition frequencies for the molecular transitions detected in interstellar and circumstellar clouds.
CODATA Fundamental Physical Constants , developed in the Physics Laboratory at NIST, addresses three topics: fundamental physical constants, the International System of Units (SI), which is the modern metric system, and expressing the uncertainty of measurement results.
Fundamental Physical Constants - International System of Units (SI) lists important definitions related to the modern metric system of measurement.
Fundamental Physical Constants - Searchable Bibliography on the Constants contains citations for the most important theoretical and experimental publications relevant to the fundamental constants and closely related precision measurements published since the mid 1980s.
Ground Levels and Ionization Energies for the Neutral Atoms provides data for ground state electron configurations and ionization energies for the neutral atoms (Z = 1-104) including references.
The Guide to Available Mathematical Software studies techniques to provide scientists and engineers with improved access to reusable computer software components which are available to them for use in mathematical modeling and statistical analysis. It provides centralized access to such items as abstracts, documentation, and source code of software modules that it catalogs.
Guidelines for Evaluating and Expressing the Uncertainty of NIST Measurement Results presents a method of evaluating and expressing uncertainty in measurement adapted from NIST Technical Note 1297.
NIST Heat Transmission Properties of Insulating and Building Materials provides a valuable reference for building designers, material manufacturers, and researchers in the thermal design of building components and equipment. NIST has accumulated a valuable and comprehensive collection of thermal conductivity data from measurements performed with a 200-mm square guarded-hot-plate apparatus (from 1933 to 1983).
NIST High Temperature Superconducting Materials Database provides evaluated thermal, mechanical, and superconducting property data for oxides and other nonconventional superconductors.
NIST Interactive Algorithm for Isotopic CO2 Measurements is a web-based tool for converting carbon dioxide isotope measurements into standardized delta 13-C and delta-18-O values.
The International Comparisons Database provides information on Appendices B and D of the Comité International des Poid et Mesures (CIPM) Mutual Recognition Arrangement (MRA). The official source of the data is the BIPM key comparison database. The ICDB provides access to results of comparisons of measurements and standards organized by the consultative committees of the CIPM and the Regional Metrology Organizations.
The NIST ITS-90 Thermocouple Database contains the most commonly used tables of NIST Monograph 175, "Temperature-Electromotive Force Reference Functions and Tables for the Letter-Designated Thermocouple Types Based on the ITS-90," by Burns, Kaeser (formerly Scroger), Strouse, Croarkin, and Guthrie. These reference functions have been adopted as standards by the American Society for Testing and Materials and the International Electrotechnical Commission.
The Matrix Market Database is a visual repository of test data for use in comparative studies of algorithms for numerical linear algebra, featuring nearly 500 sparse matrices from a variety of applications, as well as matrix generation tools and services.
NLTE4 Plasma Population Kinetics Database Welcome to the NIST NLTE-4 Plasma Kinetics Modeling Database! This database contains benchmark results for simulation of plasma population kinetics and emission spectra. The data were contributed by the participants of the 4th Non-LTE Code Comparison Workshop who have unrestricted access to the database. The only limitation for other users is in hidden labeling of the output results. Guest users can proceed to the database entry page without entering userid and password.
Phase Diagrams and Computational Thermodynamics - Solder Systems predicts melting temperatures and freezing ranges of lead-free solders which are used to remove lead-containing components from commercial products. This database also shows the effects of non-equilibrium solidification. Also included is a collection of calculated binary and ternary systems that are relevant to solders.
NIST Property Data Summaries for Advanced Materials are topical collections of property values derived from surveys of published data. Thermal, mechanical, structural, and chemical properties are included in the collections.
The Protein Data Bank PDB is the single worldwide repository for the processing and distribution of 3-D biological macromolecular structure data.
It is currently managed by a collaboration of - Rutgers University, Univ. of California - San Diego and NIST.
Radionuclide Half-Life Measurements presents the half-lives of many radionuclides measured at NIST. Revised values for the half lives of various "short-lived" radionuclides arise from improved impurity analysis, incorporation of additional data from new sources, and reevaluation of old data.
NIST Recommended Rest Frequencies for Observed Interstellar Molecular Microwave Transitions -1991 Revision provides critically evaluated transition frequencies for the molecular transitions detected in interstellar and circumstellar clouds. The tabulated transitions are recommended for reference in future astronomical observations in the microwave and millimeter wavelength regions.
SAHA Plasma Population Kinetics Database Welcome to the NIST Saha Plasma Kinetics Modeling Database. This database contains benchmark results for simulation of plasma population kinetics and emission spectra. The data were contributed by the participants of the 3rd Non-LTE Code Comparison Workshop who have unrestricted access to the database. The only limitation for other users is in hidden labeling of the output results. Guest users can proceed to the database entry page without entering userid and password.
Short Tandem Repeat DNA Internet Database is intended to benefit research and application of short tandem repeat DNA markers for human identity testing. Facts and sequence information on each STR system, population data, commonly used multiplex STR systems, PCR primers and conditions and a review of various technologies for analysis of STR alleles have been included.
The Statistical Reference Datasets is also supported by the Standard Reference Data Program. The purpose of this project is to improve the accuracy of statistical software by providing reference datasets with certified computational results that enable the objective evaluation of statistical software.
Stopping-Power and Range Tables for Electrons, Protons & Helium Ions The databases ESTAR, PSTAR, and ASTAR calculate stopping-power and range tables for electrons, protons, or helium ions, according to methods described in ICRU Reports 37 and 49. Stopping-power and range tables can be calculated for electrons in any user-specified material and for protons and helium ions in 74 materials.
NIST Structural Ceramics Database (WebSCD) provides evaluated materials property data for a wide range of advanced ceramics known variously as structural ceramics, engineering ceramics, and fine ceramics.
Tables of X-Ray Mass Attenuation Coefficients and Mass Energy - Absorption Coefficients from 1keV to 20 MeV for Elements Z = 1 to 92 and 48 Additional Substances of Dosimetric Interest include tables covering energies of the photon (x-ray, gamma ray, bremsstrahlung) from 1 keV to 20 MeV.
Thermodynamics of Enzyme-Catalyzed Reactions Database contains thermodynamic data on enzyme-catalyzed reactions that have been recently published in the Journal of Physical and Chemical Reference Data (JPCRD). For each reaction the following information is provided: the reference for the data, the reaction studied, the name of the enzyme used and its Enzyme Commission number, the method of measurement, the data and an evaluation thereof.
ThermoPlan - NIST Standard Reference Database #167 Experimental Planning and Coverage Evaluation Aid for Thermophysical Property Measurements
This web application provides free and open access for the broader research community to the experimental planning utilities that are incorporated into ThermoData Engine (TDE) [J. Chem. Inf. Model. 2005, 45, 816-838]. TDE provides recommendations for the relative merit of a proposed measurement via assessment of the existing body of knowledge, including availability of experimental thermophysical property data, variable ranges studied, associated uncertainties, state of prediction methods, and parameters for deployment of prediction methods. The web applications provides utilities for the assessment of specific property measurements for pure and binary chemical systems, the broader data needs of pure systems, and recommendations for binary mixture measurements that could extend the current UNIFAC model.
The Database of the Thermophysical Properties of Gases Used in the Semiconductor Industry concerns transport and thermodynamic property data for the gases used in semiconductor processing. The data are useful for equipment modeling in chemical vapor deposition (CVD) processes and the data will also provide a rational basis for the calibration of mass flow controllers (MRCs) used to meter process gases.
The NIST Visible Cement Dataset consists of data on cement pastes with ratios between 0.3 and 0.45 that were prepared and viewed after various hydration times.
Wavenumber Calibration Tables from Heterodyne Frequency Measurements contains the bibliography and atlas as updated through November 1994. The atlas and wavenumber tables consist of many pages of spectral maps accompanied by tables of transition wavenumbers and their identity.
XCOM: Photon Cross Sections Database can be used to calculate photon cross sections for scattering, photoelectric absorption and pair production, as well as total attenuation coefficients, for any element, compound or mixture (Z <= 100) at energies from 1 keV to 100 GeV.
X-Ray Attenuation and Absorption for Materials of Dosimetic Interest Tables and graphs of the photon mass attenuation coefficient and the mass energy-absorption coefficient are presented for all of the elements Z = 1 to 92, and for 48 compounds and mixtures of radiological interest. The tables cover energies of the photon (x-ray, gamma ray, bremsstrahlung) from 1 keV to 20 MeV.
X-Ray Form Factor, Attenuation and Scattering Tables this database collects tables and graphs of the form factors, the photoabsorption cross section, and the total attenuation coefficient for any element (Z <=92).
The NIST X-ray Photoelectron Spectroscopy (XPS) Database gives easy access to the energies of many photoelectron and Auger-electron spectral lines. Resulting from a critical evaluation of the published literature, the database contains over 19,000 line positions, chemical shifts, doublet splittings, and energy separations of photoelectron and Auger-electron lines. A highly interactive program allows the user to search by element, line type, line energy, and many other variables.
- Extreme Heat
- Climate Resilience and Adaptation
- Plastics Policy
- Nature-Based Solutions
Energy Data Analytics
- Decarbonization
- Sustainable Infrastructure
- Water Policy
- Ecosystem Services
- Climate Risk
Students Accelerate Data-Driven Climate Research through Climate+
More than 30 students participated on eight project teams in summer 2024.
This summer, students in Duke University’s Climate+ program used data science techniques to research climate challenges and potential solutions. They studied topics like saltwater intrusion, energy materials, rainfall predictions and links between climate and health.
Summer 2025 Proposals Due November 4
Duke faculty are invited to submit proposals for Climate+ projects to take place in summer 2025.
Call for Proposals
Climate+ offers students opportunities to take part in small research teams as a part of Duke’s 10-week Data+ summer experience. Teams of two to four undergraduate students work with a graduate student project manager and faculty leads to collect, analyze and/or visualize data to contribute to climate research. Students, who are pursuing degrees across a range of disciplines, learn to apply data science techniques like machine learning and geospatial data analysis as they undertake projects.
For summer 2024, Climate+ teams included:
- Environmental and Climate Exposures and Social Determinants of Health
- Data- and Machine Learning–Driven Analysis of Atomic Dynamics in Energy Materials
- Detecting Saltwater Intrusion in Rivers Using Remote Sensing
- Monitoring Spartina alterniflora Using Self-supervised Learning
- Duke Forest Reptile and Amphibian Data
- Energy Transition During Energy Crisis: Cape Town's Experience
- Improving Future Rainfall Predictions in the Southeastern US
- Making Climate Hazard Risk Data Useful for North Carolina Communities
Findings from this summer’s teams are already informing climate solutions. One group organized and analyzed reptile and amphibian observations in Duke Forest, providing insights that are helping forest managers monitor and protect species.
Another Climate+ team worked closely with the town of Creswell, NC , and the North Carolina Office of Recovery and Resilience to measure flood risks, developing different damage scenarios to help the town and its residents prepare for flooding impacts.
“When the students went to do on-the-ground data collection in Creswell, they got to know town leaders and some of the people who are facing flood risks,” said Robert Calderbank, director of the Rhodes Information Initiative at Duke. “People in the community were eager to partner with students around this challenge, which becomes more urgent with every heavy rain event.”
Enthusiastic about the progress of the partnership, Creswell community leaders will soon be meeting with Duke and NCORR to discuss next steps. A Bass Connections team will build on the partnership’s efforts during the 2024-2025 school year.
Since the summer 2022 launch of Climate+, more than 90 students have contributed to 21 interdisciplinary project teams spanning ecology, biology, engineering, environmental science and more.
Like all students in the broader Data+ program, Climate+ students have opportunities to learn from visiting data science professionals across numerous industries and from other student teams’ experiences and insights.
In addition, Climate+ students participate in a series of unique workshops to enhance their climate literacy, data science and interdisciplinary communication skills. Guest speakers at this year’s workshops covered topics like machine learning, data visualization, climate change science, sustainable agriculture and climate hazard risks and decision-making.
"Climate+ provides students interested in data science with opportunities to learn how these tools can help us address the causes and consequences of climate change. Over the summer, students can make meaningful progress toward climate solutions,” said Kyle Bradbury, director of the Energy Data Analytics Lab at the Nicholas Institute for Energy, Environment & Sustainability.
Climate+ is offered by the Nicholas Institute in partnership with the Rhodes Information Initiative at Duke. The program is aligned with the Duke Climate Commitment , a university-wide initiative that unites Duke’s education, research, operations and public service missions to address climate challenges. Funding for Climate+ comes from The Duke Endowment and the Rhodes Information Initiative.
Unconventional Data Sources Fuel Research Innovations
- Featured - Observer
- Interdisciplinary
- Market-research panels offer researchers access to millions of participants, specializing in the ability to engage hard-to-reach groups. But their use is currently limited to less than 15% of psychological studies.
- Population-level administrative data offer an affordable and detailed source of information for longitudinal studies.
- Data from global positioning systems (GPS) can be integrated with other types of data, such as heart rate or life satisfaction, and can be analyzed with familiar statistical methods like correlations and regressions.
- Special ethical and data quality considerations may be needed when researchers use unconventional data sources.
- By engaging in interdisciplinary collaborations, researchers are more likely to be exposed to new approaches to research, including the use of unconventional data sources.
Administrative data support research on rare and long-term outcomes • GPS data can provide new insights on movement behavior • Interdisciplinary work leads to innovative thinking
As a postdoctoral researcher studying experimental psychology at New York University (NYU) in the early 2000s, Leib Litman had no problem finding participants for his large studies on episodic memory.
“NYU is this huge place,” Litman said in an interview with the Observer . “There is an endless participant pool of undergraduates that you have access to pretty much at any time of the year, except maybe in the summer where it gets a little bit more difficult to recruit participants.”
But when he later moved on to a faculty position at a small, private college, he realized access to large numbers of students was a luxury he would no longer be afforded. This need for participants to fuel his research led Litman and a colleague in computer science, Jonathan Robinson, to create a suite of online tools that would expedite the process of identifying participants.
“I’ll never forget the first time we did a research study online and collected 500 people in a matter of an hour,” Litman said. “It was really one of those life-changing moments when I realized, you know, this is a complete revolution in science.”
Litman is now one of the cofounders and chief research officer for CloudResearch , an online research and participant recruitment platform. He is also a professor of psychology at Touro University’s Lander College.
What started as a solution to a personal research problem now serves tens of thousands of researchers at over 5,000 institutions. Litman remembers one of the first times he unveiled the project to the research community at an APS Annual Convention about a decade ago.
“It was like standing room only,” he said. “People were extremely, extremely interested because it was very clear that the problems that I was having, everybody else was having, too.”
Since then, online studies have become the norm for psychological research, and CloudResearch’s Connect is one of the major platforms that researchers turn to for participant recruitment. But CloudResearch also offers another option to find participants that Litman believes has been largely underutilized for behavioral research: market-research panels.
With their Prime Panels platform, CloudResearch aggregates over 100 million participants from 300 market-research panels—a participant pool that massively eclipses the approximately 100,000 available on Connect.
Yet Litman estimates that only about 10%–15% of psychological studies turn to market-research panels for participant recruitment, though they are more common in other disciplines like political science.
In a recent article for Advances in Methods and Practices in Psychological Science , Litman and his colleagues provided a tutorial on the best practices in using market-research panels for behavioral science to help researchers decide if panels are the right approach for their studies (Moss et al., 2023).
Panels are run by market-research platforms with the goal of recruiting participants to understand consumer behavior and perceptions around a particular product. They vary in their approach, but they usually include a rewards program that incentivizes participation. Panels specialize in targeting different populations, organized by factors such as demographic segments, geographic regions, or language-specific recruitment. They also allow researchers to sample participants from most countries around the world.
“The main benefit of aggregating across multiple platforms is the ability to reach people at the kinds of scales that can’t be matched at all with any single platform,” Litman said. “Like when you’re looking for difficult-to-reach clinical participants or participants within specific cities or even ZIP code areas. For consumer research, you can find people who are using products in a very specific way.”
The challenge with using market-research panels is the lack of control over the platform, which can lead to data quality issues. Researchers do not control how much participants are paid and need to screen carefully to weed out fraudulent participation.
“There are a lot of papers that are written that just contain misinformation because they didn’t do enough to clean the data,” Litman said.
CloudResearch has combatted the issue of bad data quality by creating Sentry, a tool that automatically filters out low-quality and fraudulent responses by examining the technical and behavioral characteristics of each participant before they enter a survey. The tool takes about 20 seconds per participant and filters out about 30% of panel traffic. Even so, researcher vigilance is a must.
“The vast majority of fraud is removed through that mechanism,” Litman said. “But there’s only so much we could do, and so it is a partnership between CloudResearch and the researchers.”
Litman has seen the landscape of psychological research change drastically over the past decade, with online research revolutionizing what’s possible for social sciences, but he asserts that the ease of accessing participants brings new challenges that researchers must learn to problem-solve.
“It has to be done right, otherwise you run the risk of misinforming science and misinforming the public,” he said.
Administrative data support research on rare and long-term outcomes
Another methodological approach less chosen by psychologists is the use of data from administration systems. These data are created as individuals interact with government and private administrative systems in areas such as health care, social welfare, criminal justice, and education.
In the United States, multiple large-scale administrative systems are designed for research, including birth and death records from the National Vital Statistics System, school test scores from the National Center for Education Statistics, and use of health care services from the Veterans Health Administration. Some of these data are publicly available, while sensitive information has restricted access and specific protocols for researchers to follow.
Leah Richmond-Rakerd, an assistant professor of psychology at the University of Michigan and a 2024 APS Janet Taylor Spence Award recipient, first became interested in the power of administrative data while working with epidemiologic survey data during her graduate research.
“That really helped to introduce me to the benefits of things like representative sampling and being able to work with large data sources to study associations across population subgroups or over time,” Richmond-Rakerd said in an interview with the Observer .
Learn more about the 2024 Spence award recipients.
Richmond-Rakerd and her colleagues recently had a paper published in Current Directions in Psychological Science that describes a few distinct, and largely untapped, benefits of using population-level administrative data for psychological research.
First, data collection is expensive, especially when done over large scales or over an extended period. And for longitudinal studies , it can be difficult to ensure the sample stays consistent.
“If we’re conducting research on people over time, they may drop out of studies over time, and we may lose access to them and their information,” she said.
Conversely, administrative data can often be accessed at no cost to the researcher. And because administrative data have detailed information about the timing of specific events—the time a new medication is prescribed, for example—they can pinpoint what factors led to a specific outcome.
These data also offer the opportunity to study conditions that are rare in the population, such as schizophrenia or suicide mortality.
“Often times, when researchers are interested in those kinds of things, they have to turn to more selected samples to obtain sufficient numbers of people,” Richmond-Rakerd said. “But in population-level administrative data, researchers can study those kinds of lower prevalence conditions while still working within a representative data source.”
Another unique opportunity for researchers using administrative data is to link that information to other datasets, such as those that contain residential information or large-scale environmental characteristics. For example, Richmond-Rakerd worked with colleagues at the University of Virginia, Duke University, the University of Auckland, and the University of Otago to study the link between risk for dementia and the characteristics of the neighborhoods in which individuals lived.
“We don’t yet, in the United States, have the ability to link information about people’s interactions with different types of systems at the individual level nationwide,” Richmond-Rakerd said. “Those kinds of population-level administrative data sources do, however, exist in other countries, such as the ones that my team has worked with in New Zealand and Denmark, and in other countries such as Sweden.”
Large-scale datasets come with challenges. For example, some information, including about social identities, may not be systematically or precisely measured.
“Administrative data traditionally are not collected specifically for research,” she clarified. “These data are recorded as part of the carrying out or delivery of various public services.”
GPS data can provide new insights on movement behavior
Location-based data have also been included in the recent wave of new data sources used for psychological research. Researchers have begun to experiment with ways to incorporate GPS data into research on behavior, tracking patterns of movement and locations visited.
Interdisciplinary Work Leads to Innovative Thinking
Sharon Koppman, a sociologist and associate professor at the University of California, Irvine’s Paul Merage School of Business, has seen the influence of interdisciplinary environments on innovation: In her research, she has found that the presence of inroads into other disciplines often allows for novel approaches to slip in.
Koppman and her colleague Erin Leahey, a professor of sociology at the University of Arizona, looked to their own field of sociology to investigate the factors that lead scientists to adopt unconventional methods—such as accessing data from atypical sources—in their research. In a study focused on individuals with sociology PhDs, the researchers found that participants with higher status in their careers were more likely to try unconventional methods than those with lower status. In this case, these higher-status participants were primarily men who were affiliated with top-tier universities.
“They’re more likely to innovate and also fail,” Koppman said in an interview with the Observer . “But they’ve already kind of made it, and so their failures are not really going to affect them very much.”
Koppman said researchers from some fields are more likely to try new approaches than others, which can often be influenced by how a field defines itself. If a field is beholden to a particular method, such as the ethnographic approach of anthropology, trying a new approach can feel like changing the definition of what it means to work in that field. By creating departments that include perspectives from multiple disciplines, Koppman believes institutions can help facilitate a more consistent exchange among scientists as they become familiar with new methods, data sources, and approaches to research.
GPS data can be integrated with other types of data, such as heart rate or life satisfaction, and can be analyzed with familiar statistical methods like correlations and regressions. But researchers require a specific skillset to use these data effectively.
In a 2022 tutorial paper, Sandrine Müller and colleagues describe how to manage challenges associated with these data, such as privacy considerations and how to interpret the psychological implications of movement patterns ( Müller et al., 2022 ).
Like market-research panels, GPS data require a specific data quality process before they can be analyzed. Researchers must identify and remove inaccurate GPS records, which are not uncommon because of frequent technical issues such as lapses in satellite connectivity.
To ensure ethical use of GPS data, researchers must give special consideration to how the data are secured and disconnected from any participant identifiers. This includes removing the coordinates of the home and work locations of participants and assigning labels to obscure exact locations.
Richmond-Rakerd also emphasized the unique ethical considerations of relying on administrative data. She stressed the importance of using responsible research practices when using these data, such as developing research questions and hypotheses before engaging with datasets.
“It’s important to keep in mind with administrative data that you’re often working with very, very large-scale data resources, and so most associations will be statistically significant,” she said, adding that it can be helpful to focus more on effect size than significance.
As researchers continue to learn how to most effectively use unconventional data sources, they share lessons learned with those in their own fields, and also with collaborating researchers from other fields. Richmond-Rakerd anticipates that use of administrative datasets will become more common as psychologists collaborate with researchers in fields like economics and health, where they are more commonly used, as well as those outside of the United States.
“More interdisciplinary collaboration isn’t just beneficial for bringing in new theoretical or methodological perspectives, but also opens up opportunities for psychologists to gain more experience and training in working with these kinds of data resources,” Richmond-Rakerd said.
Back to top
Feedback on this article? Email [email protected] or login to comment.
Koppman, S. & Leahey, E. (2019). Who moves to the methodological edge? Factors that encourage scientists to use unconventional methods. Research Policy , 48 (9), Article 103807. https://doi.org/10.1016/j.respol.2019.103807
Moss, A. J., Hauser, D. J., Rosenzweig, C., Jaffe, S., Robinson, J., & Litman, L. (2023). Using market-research panels for behavioral science: An overview and tutorial. Advances in Methods and Practices in Psychological Science , 6 (2). https://doi.org/10.1177/25152459221140388
Müller, S. R., Bayer, J. B., Ross, M. Q., Mount, J., Stachl, C., Harari, G. M., Yung-Ju, C., & Huyen, H. T. (2022). Analyzing GPS data for psychological research: a tutorial. Advances in Methods and Practices in Psychological Science , 5 (2). https://doi.org/10.1177/25152459221082680
Richmond-Rakerd, L. S., Dent, K. R., Andersen, S. H., D’Souza, S., & Milne, B. J. (2024). Population-level administrative data: A resource to advance psychological science. Current Directions in Psychological Science . https://doi.org/10.1177/09637214241275570
APS regularly opens certain online articles for discussion on our website. Effective February 2021, you must be a logged-in APS member to post comments. By posting a comment, you agree to our Community Guidelines and the display of your profile information, including your name and affiliation. Any opinions, findings, conclusions, or recommendations present in article comments are those of the writers and do not necessarily reflect the views of APS or the article’s author. For more information, please see our Community Guidelines .
Please login with your APS account to comment.
How Wearable Device Data Can Fuel Digital Interventions
Researchers are using data from wearable devices to deliver digital interventions when people need them most.
Creating a Global ‘BRIDGE’ for Brain Research Data
The Brain Research International Data Governance & Exchange (BRIDGE) project aims to create a responsible and sustainable governance system for data sharing. Learn how the group is advancing open practices, reproducibility, and psychological science as a whole.
Practical Protections
In the era of open science, researchers encounter the challenges of preserving participant privacy when sharing data from qualitative interviews. Learn how you can balance transparency and confidentiality.
Privacy Overview
Cookie | Duration | Description |
---|---|---|
__cf_bm | 30 minutes | This cookie, set by Cloudflare, is used to support Cloudflare Bot Management. |
Cookie | Duration | Description |
---|---|---|
AWSELBCORS | 5 minutes | This cookie is used by Elastic Load Balancing from Amazon Web Services to effectively balance load on the servers. |
Cookie | Duration | Description |
---|---|---|
at-rand | never | AddThis sets this cookie to track page visits, sources of traffic and share counts. |
CONSENT | 2 years | YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data. |
uvc | 1 year 27 days | Set by addthis.com to determine the usage of addthis.com service. |
_ga | 2 years | The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors. |
_gat_gtag_UA_3507334_1 | 1 minute | Set by Google to distinguish users. |
_gid | 1 day | Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously. |
Cookie | Duration | Description |
---|---|---|
loc | 1 year 27 days | AddThis sets this geolocation cookie to help understand the location of users who share the information. |
VISITOR_INFO1_LIVE | 5 months 27 days | A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface. |
YSC | session | YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages. |
yt-remote-connected-devices | never | YouTube sets this cookie to store the video preferences of the user using embedded YouTube video. |
yt-remote-device-id | never | YouTube sets this cookie to store the video preferences of the user using embedded YouTube video. |
yt.innertube::nextId | never | This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen. |
yt.innertube::requests | never | This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen. |
Liquid Crystals Mimic Life in Stunning New Research
New research reveals that liquid crystals can form dynamic structures that mimic biological transport systems, suggesting potential applications in creating self-assembling materials and modeling biological systems.
Liquid crystals are everywhere. They are used in numerous applications, such as cell phone screens, video game consoles, car dashboards, and medical devices. Due to the unique properties of these fluids, if you run an electric current through liquid crystal displays (LCDs), they generate colors: rearrange their shape, and they will reflect different wavelengths of light.
New Discoveries in Liquid Crystal Structures
Now, researchers at the lab of Chinedum Osuji, Eduardo D. Glandt Presidential Professor and Chair of Chemical and Biomolecular Engineering, have discovered these remarkable crystals may be able to do even more. Under the right conditions, liquid crystals condense into astonishing structures, spontaneously generating filaments and flattened discs that can transport material from one place to another, much like complex biological systems. This insight may lead to self-assembling materials, new ways to model cellular activity, and more.
“It’s like a network of conveyor belts,” says Christopher Browne, a postdoctoral researcher in Osuji’s lab and the co-first author of a recent paper in Proceedings of the National Academy of Sciences (PNAS) that describes the finding. “It was this serendipitous observation of something that superficially looks very lifelike — that was the initial cue that this might be something more general and more interesting.”
Collaborative Research on Condensate Formation
Browne and Osuji are now part of an NSF-supported interdisciplinary group at the Laboratory for Research on the Structure of Matter (LRSM) led by Matthew Good, Associate Professor of Cell and Developmental Biology within the Perelman School of Medicine, and Elizabeth Rhoades, Professor of Chemistry within the School of Arts & Sciences, that is studying condensate formation in biological and non-biological systems.
Unusual Behavior in Liquid Crystal Phase Separation
Originally, Osuji’s lab partnered with ExxonMobil to investigate mesophase pitch, a substance used in the development of high-strength carbon fibers, like those found in Formula 1 cars and high-end tennis rackets. “Those materials are liquid crystals,” says Osuji, of the chemical precursors to the carbon fibers themselves. “Or better stated, they are liquid crystalline over some period of their existence during processing.” While experimenting with condensates at different temperatures, Yuma Morimitsu, another postdoctoral fellow in the Osuji Lab and the paper’s other co-first author, noticed unusual behavior in the material.
Normally, if you combine two immiscible — that is, not mixable — fluids and then heat them to a high enough temperature to force them to mix, if you then cool the mixture, at some point, it will separate or “demix.” Typically, this happens by the formation of droplets that coalesce to form a separate layer, much like how, if you combine oil and water, you eventually wind up with a layer of oil on top of the water.
Unique Phase Separation and Structural Formation
In this case, the liquid crystal, 4’-cyano 4-dodecyloxybiphenyl, also known as 12OCB, spontaneously formed highly irregular structures when separating from squalane, a colorless oil. “Instead of forming drops,” says Osuji, “when you have this phase separation between the liquid crystal and the other components of the system, you form cascaded structures, the first of which is these filaments, which grow rapidly and thereafter form another set of structures — what we call bulged discs or flat droplets.”
Observations and Implications of Liquid Crystal Behavior
To understand the system, the researchers used powerful microscopes to observe the liquid crystals’ movement on the micrometer scale — that is, millionths of a meter, comparable to the width of a human hair. “The first time we saw these structures, we looked at them at a cooling rate that was excessively high,” recalls Osuji, leading the liquid crystals to clump together. Only by lowering the cooling rate and further zooming in did the researchers realize that the liquid crystals were spontaneously forming structures reminiscent of biological systems.
Interestingly, Browne found, that several researchers had come close to observing similar behavior decades ago, but either studied systems in which the behavior was not particularly pronounced or lacked microscopy powerful enough to visualize what was happening.
Potential Applications and Future Research
For Browne, the result’s most exciting implication is that it brings together several traditionally disparate fields: the world of active matter research, which focuses on biological systems that transport material and produce motion, and the realms of self-assembly and phase behavior, which study materials that create new structures on their own and that behave differently when changing phase. “This is a new type of active matter system,” says Browne.
He and Osuji also point to the possibility of leveraging the findings to emulate biological systems, either to better understand how they work or to manufacture materials. “Molecules are being absorbed into the filaments and then shuttled into those flat droplets continuously,” says Osuji, “even though just by looking at the system, you can’t discern any obvious activity.” In effect, the flat droplets could function like small reactors, churning out molecules that the filaments carry to other droplets for storage or further chemical activity.
The researchers also suggest that their findings could reinvigorate research into liquid crystals themselves. “When a field becomes industrialized,” says Browne, “oftentimes the fundamental research tapers off. But sometimes there are lingering puzzles that nobody finished solving.”
Reference: “Spontaneous assembly of condensate networks during the demixing of structured fluids” by Yuma Morimitsu, Christopher A. Browne, Zhe Liu, Paul G. Severino, Manesh Gopinadhan, Eric B. Sirota, Ozcan Altintas, Kazem V. Edmond and Chinedum O. Osuji, 13 September 2024, Proceedings of the National Academy of Sciences . DOI: 10.1073/pnas.2407914121
This study was conducted at the University of Pennsylvania , in the School of Engineering and Applied Science’s Department of Chemical and Biomolecular Engineering and the School of Arts & Sciences’ Department of Physics and Astronomy, and ExxonMobil’s Research Division. The work was supported by a grant from ExxonMobil and by the U.S. National Science Foundation (DMR-2309043).
Related Articles
Reevaluating drug safety: real-world data challenges old methods, the science of gossip: researchers uncover surprising cooperative benefits, new research demonstrates that common sense is not so common after all, suppressing science: are we overreacting to controversial findings, scientists reveal: does money really buy happiness, 12 recommendations to protect the integrity of survey research, news media misinformation persists: the undying holiday-suicide myth, one brain region teaches another during sleep, converting new data into enduring memories, 20 years later: experts share their thoughts on how 9/11 transformed their field, their research, and the world.
Save my name, email, and website in this browser for the next time I comment.
Type above and press Enter to search. Press Esc to cancel.
Help | Advanced Search
Computer Science > Computation and Language
Title: towards a realistic long-term benchmark for open-web research agents.
Abstract: We present initial results of a forthcoming benchmark for evaluating LLM agents on white-collar tasks of economic value. We evaluate agents on real-world "messy" open-web research tasks of the type that are routine in finance and consulting. In doing so, we lay the groundwork for an LLM agent evaluation suite where good performance directly corresponds to a large economic and societal impact. We built and tested several agent architectures with o1-preview, GPT-4o, Claude-3.5 Sonnet, Llama 3.1 (405b), and GPT-4o-mini. On average, LLM agents powered by Claude-3.5 Sonnet and o1-preview substantially outperformed agents using GPT-4o, with agents based on Llama 3.1 (405b) and GPT-4o-mini lagging noticeably behind. Across LLMs, a ReAct architecture with the ability to delegate subtasks to subagents performed best. In addition to quantitative evaluations, we qualitatively assessed the performance of the LLM agents by inspecting their traces and reflecting on their observations. Our evaluation represents the first in-depth assessment of agents' abilities to conduct challenging, economically valuable analyst-style research on the real open web.
Subjects: | Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG) |
Cite as: | [cs.CL] |
(or [cs.CL] for this version) | |
Focus to learn more arXiv-issued DOI via DataCite |
Submission history
Access paper:.
- HTML (experimental)
- Other Formats
References & Citations
- Google Scholar
- Semantic Scholar
BibTeX formatted citation
Bibliographic and Citation Tools
Code, data and media associated with this article, recommenders and search tools.
- Institution
arXivLabs: experimental projects with community collaborators
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .
IMAGES
VIDEO
COMMENTS
Google Scholar provides a simple way to broadly search for scholarly literature. Search across a wide variety of disciplines and sources: articles, theses, books, abstracts and court opinions.
This article contains a representative list of notable databases and search engines useful in an academic setting for finding and accessing articles in academic journals, institutional repositories, archives, or other collections of scientific and other articles. Databases and search engines differ substantially in terms of coverage and retrieval qualities. [1]
IEEE Xplore: an academic database specifically for engineering and computer science. 6. ScienceDirect. ScienceDirect is the gateway to the millions of academic articles published by Elsevier, 1.4 million of which are open access. Journals and books can be searched via a single interface.
Search all biomedical databases provided by the National Center for Biotechnology Information (NCBI), an agency of the U.S. National Library of Medicine at the NIH ... life science journals, and online books. ... joined several senior leaders and scientists from the National Cancer Institute to discuss advances in childhood cancer research, and ...
Harness the power of visual materials—explore more than 3 million images now on JSTOR. Enhance your scholarly research with underground newspapers, magazines, and journals. Take your research further with Artstor's 3+ million images. Explore collections in the arts, sciences, and literature from the world's leading museums, archives, and ...
PubMed® comprises more than 37 million citations for biomedical literature from MEDLINE, life science journals, and online books. Citations may include links to full text content from PubMed Central and publisher web sites. ... MeSH Database Journals Trending Articles PubMed records with recent increases in activity Global, regional, and ...
3.3 million articles on ScienceDirect are open access. Articles published open access are peer-reviewed and made freely available for everyone to read, download and reuse in line with the user license displayed on the article. ScienceDirect is the world's leading source for scientific, technical, and medical research.
Web of Science is a leading scientific research platform offering comprehensive data, metrics, and insights across disciplines.
Advanced. Journal List. PubMed Central ® (PMC) is a free full-text archive of biomedical and life sciences journal literature at the U.S. National Institutes of Health's National Library of Medicine (NIH/NLM)
Scopus: Comprehensive, multidisciplinary, trusted abstract and citation database. Quickly find relevant and authoritative research, identify experts and gain access to reliable data, metrics and analytical tools. Be confident in advancing research, educational goals, and research direction and priorities — all from one database.
Take a look at our compilation of academic research databases: Scopus, Web of Science, PubMed, ERIC, JSTOR, DOAJ, Science Direct, and IEEE Xplore. The best academic search engines [Update 2024] Your research is stuck, and you need to find new sources. Take a look at our compilation of free academic search engines: Google Scholar BASE CORE ...
Access 160+ million publications and connect with 25+ million researchers. Join for free and gain visibility by uploading your research.
NCBI is streamlining the terminology around our reference genomes. We currently have a small set of genomes collectively called representatives and an even smaller set called references. We have slowly converged on the term reference to refer to both sets. A genome is labeled reference if it is deemed to be the best available genome ….
Contains over 170,000 documents in fifteen scientific disciplines. Smithsonian Libraries and Archives offers staff and visitors access to many scientific databases, including Zoological Abstracts, Anthrosource, and Web of Science. A complete listing of these databases is located on the Libraries' E-journals, E-books, and Databases.
What is Database Search? Harvard Library licenses hundreds of online databases, giving you access to academic and news articles, books, journals, primary sources, streaming media, and much more. The contents of these databases are only partially included in HOLLIS. To make sure you're really seeing everything, you need to search in multiple places.
Find the research you need | With 160+ million publication pages, 1+ million questions, and 25+ million researchers, this is where everyone can access science
About the directory. DOAJ is a unique and extensive index of diverse open access journals from around the world, driven by a growing community, and is committed to ensuring quality content is freely available online for everyone. DOAJ is committed to keeping its services free of charge, including being indexed, and its data freely available.
Open Data is a strategy for incorporating research data into the permanent scientific record by releasing it under an Open Access license. Whether data is deposited in a purpose-built repository or published as Supporting Information alongside a research article, Open Data practices ensure that data remains accessible and discoverable. For ...
Library, Information Science & Technology Abstracts (LISTA) is a free research database for library and information science studies. LISTA provides indexing and abstracting for hundreds of key journals, books, research reports. It is EBSCO's intention to provide access to this resource on a continual basis. Access now.
Find support. Find answers to questions about products, access, setup, and administration. Visit the support center. ProQuest powers research in academic, corporate, government, public and school libraries around the world with unique content. Explore millions of resources from scholarly journals, books, newspapers, videos and more.
Browse, search, and explore journals indexed in the Web of Science. The Master Journal List is an invaluable tool to help you to find the right journal for your needs across multiple indices hosted on the Web of Science platform. Spanning all disciplines and regions, Web of Science Core Collection is at the heart of the Web of Science platform. Curated with care by an expert team of in-house ...
Here we present SciSciNet, a large-scale open data lake for the science of science research, covering over 134M scientific publications and millions of external linkages to funding and public uses ...
Inspec was created by the Institution of Engineering and Technology (IET), and is one of the world's most definitive bibliographic scientific engineering research databases, containing over 15 million abstracts and indexing records. Inspec is on the Engineering Village platform and can be searched together with Ei Compendex.
It is intended to simplify the process of retrieving online scientific data for a specific element. ... and fire service personal. It includes FIREDOC which is a complete fire research bibliographic database with 55,000 holdings (published reports, journal articles, conference proceedings, books, and audiovisual items) of FRIS. Each reference ...
After Science brought initial concerns about Masliah's work to their attention, a neuroscientist and forensic analysts specializing in scientific work who had previously worked with Science produced a 300-page dossier revealing a steady stream of suspect images between 1997 and 2023 in 132 of his published research papers. (Science did not ...
This summer, students in Duke University's Climate+ program used data science techniques to research climate challenges and potential solutions. They studied topics like saltwater intrusion, energy materials, rainfall predictions and links between climate and health. More than 30 students participated on eight project teams.
Researchers are finding new benefits and reserves of participants by accessing data from unconventional sources, such as market-research panels, administrative systems, and GPS data. These sources can provide much larger and more diverse information than many traditional data sources, but they also come with caveats and ethical standards to be used effectively.
An innovative discovery has unveiled a promising microorganism for CH 4 mitigation known as cable bacteria. Scholz et al. (2020) found the CH 4 emission reduced by 93 % in rice-vegetated soil pots inoculated with cable bacteria enrichment culture compared to control, suggesting the potential of cable bacteria in CH 4 mitigation. Cable bacteria are multicellular filaments discovered widely in ...
This study was conducted at the University of Pennsylvania, in the School of Engineering and Applied Science's Department of Chemical and Biomolecular Engineering and the School of Arts & Sciences' Department of Physics and Astronomy, and ExxonMobil's Research Division. The work was supported by a grant from ExxonMobil and by the U.S ...
We present initial results of a forthcoming benchmark for evaluating LLM agents on white-collar tasks of economic value. We evaluate agents on real-world "messy" open-web research tasks of the type that are routine in finance and consulting. In doing so, we lay the groundwork for an LLM agent evaluation suite where good performance directly corresponds to a large economic and societal impact ...