Using Language Models to Classify Innovation and Extract Structured Information about Product Innovation from Unstructured News Stories

Report
Authors:Kattampallil, Neil, PV-BII SDADUniversity of Virginia ORCID icon orcid.org/0000-0002-4092-1897Ratcliff, Nathaniel, PV-BII SDADPV-BII SDAD Korkmaz, Gizem, PV-BII SDADPV-BII SDAD Wang, Alan, PV-BII-Biocomplexity InitiativeUniversity of Virginia ORCID icon orcid.org/0000-0001-6926-4336Zhou, Steve, PV-BII-Biocomplexity InitiativeUniversity of Virginia Anderson, Gary NSF Jankowski, John
Abstract:

Innovation, the availability and usage of novel products and business practices, is central to improving living standards. Policymakers, in part, rely on survey-based measures of innovation to design, develop, and implement policies to promote innovation. In the U.S., the National Center for Science and Engineering Statistics (NCSES) measures innovation through nationally representative surveys of businesses, such as the Annual Business Survey (ABS). To reduce respondent fatigue and to provide more timely information, statistical organizations are interested in exploring non-traditional methods for measuring innovation to supplement existing data.

In this technical report, our goal is to document our research that demonstrates how a large corpus of opportunity data, in particular, news articles, used with advanced natural language processing methods, can be used to identify and measure innovation in various sectors (food and beverage, pharmaceutical, and computer software). We present a novel approach utilizing the Bidirectional Encoder Representation from Transformers (BERT) language model developed by Google. Our methods include (i) text classification to identify news articles that mention innovation, (ii) named-entity recognition (NER), (iii) question answering (QA) to extract company names, and (iv) developing yearly innovation indicators for companies in these sectors.

Keywords:
BERT, Natural Language Processing (NLP), Text Processing, Innovation
Contributor:Lyman, Kimberly, PV-BII SDADUniversity of Virginia
Language:
English
Source Citation:

Kattampallil T, Ratcliff N, Korkmaz G, Wang A, Zhou S, Anderson G, Jankowski J, (2023). Using Language Models to Classify Innovation and Extract Structured Information about Product Innovation from Unstructured News Stories, Proceedings of the Biocomplexity Institute, Technical Report. TR# BI-2023-5, University of Virginia. https://doi.org/10.18130/hatt-jk83.

Publisher:
University of Virginia
Published Date:
January 24, 2023
Sponsoring Agency:
National Science Foundation, National Center for Science and Engineering Statistics