Using Language Models to Classify Innovation and Extract Structured Information about Product Innovation from Unstructured News Stories
Report
orcid.org/0000-0002-4092-1897Ratcliff, Nathaniel, PV-BII SDADPV-BII SDAD Korkmaz, Gizem, PV-BII SDADPV-BII SDAD Wang, Alan, PV-BII-Biocomplexity InitiativeUniversity of Virginia
orcid.org/0000-0001-6926-4336Zhou, Steve, PV-BII-Biocomplexity InitiativeUniversity of Virginia Anderson, Gary NSF Jankowski, JohnInnovation, the availability and usage of novel products and business practices, is central to improving living standards. Policymakers, in part, rely on survey-based measures of innovation to design, develop, and implement policies to promote innovation. In the U.S., the National Center for Science and Engineering Statistics (NCSES) measures innovation through nationally representative surveys of businesses, such as the Annual Business Survey (ABS). To reduce respondent fatigue and to provide more timely information, statistical organizations are interested in exploring non-traditional methods for measuring innovation to supplement existing data.
In this technical report, our goal is to document our research that demonstrates how a large corpus of opportunity data, in particular, news articles, used with advanced natural language processing methods, can be used to identify and measure innovation in various sectors (food and beverage, pharmaceutical, and computer software). We present a novel approach utilizing the Bidirectional Encoder Representation from Transformers (BERT) language model developed by Google. Our methods include (i) text classification to identify news articles that mention innovation, (ii) named-entity recognition (NER), (iii) question answering (QA) to extract company names, and (iv) developing yearly innovation indicators for companies in these sectors.
BERT, Natural Language Processing (NLP), Text Processing, Innovation
English
Kattampallil T, Ratcliff N, Korkmaz G, Wang A, Zhou S, Anderson G, Jankowski J, (2023). Using Language Models to Classify Innovation and Extract Structured Information about Product Innovation from Unstructured News Stories, Proceedings of the Biocomplexity Institute, Technical Report. TR# BI-2023-5, University of Virginia. https://doi.org/10.18130/hatt-jk83.
University of Virginia
January 24, 2023
National Science Foundation, National Center for Science and Engineering Statistics