Resources

Awesome Scholarly Data Analysis: A curated collection of resources on scholarly data analysis ranging from datasets, papers, and code about bibliometrics, citation analysis, and other scholarly commons resources.

Semantic Scholar Academic Graph API: The RESTful Semantic Scholar Academic Graph (S2AG) API is a reliable on-demand source of data about authors, papers, citations, venues, and more. Currently the S2AG API supports Paper and Author Lookup, Conflict of Interest detection, Conference Reviewer Match, SPECTER embeddings, and SUPP.AI annotations.

OpenAlex: An open and comprehensive catalog of scholarly papers, authors, institutions, and more. Inspired by the ancient Library of Alexandria, OpenAlex is an index of hundreds of millions of interconnected entities across the global research system. It is 100% free and open source, and offers access via a web interface, API, and database snapshot.

CZ Software Mentions: A large dataset of software mentions in the biomedical literature. Also see the accompanying Medium article.

AI2 Meaningful Citations Data Set: This dataset is comprised of annotations for 465 computer science papers. The annotations indicate whether a citation is important (i.e., refers to ongoing or continued work on the relevant topic) or not and then assigns the citation one of four importance rankings.

SciFact: Due to the rapid growth in the scientific literature, there is a need for automated systems to assist researchers and the public in assessing the veracity of scientific claims. To facilitate the development of systems for this task, we introduce SciFact, a dataset of 1.4K expert-written claims, paired with evidence-containing abstracts annotated with veracity labels and rationales.

S2ORC: The Semantic Scholar Open Research Corpus A large corpus of 81.1M English-language academic papers spanning many academic disciplines. Rich metadata, paper abstracts, resolved bibliographic references, as well as structured full text for 8.1M open access papers. Full text annotated with automatically-detected inline mentions of citations, figures, and tables, each linked to their corresponding paper objects. Aggregated papers from hundreds of academic publishers and digital archives into a unified source, and create the largest publicly-available collection of machine-readable academic text to date.

SciCite: Citation intent classification dataset: Citations play a unique role in scientific discourse and are crucial for understanding and analyzing scientific work. However not all citations are equal. Some citations refer to use of a method from another work, some discuss results or findings of other work, while others are merely background or acknowledgement citations. SciCite is a dataset of 11K manually annotated citation intents based on citation context in the computer science and biomedical domains.

Science Terms and Sentences: The dataset contains 9,356 science terms and, for each term, an average of 16,000 sentences that contain the term.

Qasper: A dataset containing 1585 papers with 5049 information-seeking questions asked by regular readers of NLP papers, and answered by a separate set of NLP practitioners.

GROBID: GROBID is a machine learning library for extracting, parsing and re-structuring raw documents such as PDF into structured XML/TEI encoded documents with a particular focus on technical and scientific publications.

doc2json: Parsers for scientific papers (PDF2JSON, TEX2JSON, JATS2JSON).

DIY Information Extraction: Tools for information extraction from text.

For additional datasets from AI2 see allenai.org/data/.