Fine-Tuning Pre-Trained Large Language Models to Identify Jim Crow Laws in Virginia

Poster
Author:Odukoya, Tolu, Department: UVA Library; UVA Law LibraryUniversity of Virginia ORCID icon orcid.org/0009-0008-3662-3046
Abstract:

Can machine learning identify Jim Crow laws within other laws passed in a state? The Modeling a Racial Caste System (MRCS) project at the University of Virginia is a collections as data and machine learning project made possible by a sub-award from the University of North Carolina at Chapel Hill Libraries' project, On the Books: Jim Crow and Algorithms of Resistance, which was funded by the Mellon Foundation. The team created a plain text corpus of Virginia session laws and utilized machine learning techniques to discover Jim Crow laws passed between Reconstruction and the Civil Rights Movement (1865-1967). Disciplinary scholars on the MRCS team created a training set of laws labeled as either “Jim Crow” or “Not Jim Crow.”

This poster highlights the project's achievements, including the creation of the first Jim Crow racism-related and legislation language finetuned Large Language Model (LLM) for text classification, finetuned from the DistilBERT model. The finetuned DistilBERT model achieved .99 accuracy and F1 scores, representing a significant leap from the earlier model’s accuracy of .87. The corpus of Virginia Laws was processed from chapter level to sentence level for analysis because of the inconsistencies in chapter length. Classification with the fine-tuned model predicted 30,814 Jim Crow sentences within the corpus of 446,000 sentences. Using a deduping program, the 30,814 predicted Jim Crow sentences were reduced to 13,533 sentences. Disciplinary scholars on the MRCS team are currently working on a manual review of the 13,533 and intend to aggregate the results back to the chapter level. The project will publish the data on the UVA Law Library Website, providing users with two corpora: one of all Virginia laws passed during the period of study and a second of the Jim Crow laws identified by the model and confirmed by scholars.

Keywords:
Jim Crow, Fine-Tuned Large Language Model, Racism in the Reconstruction and the Civil Rights Movement Era , Machine Learning, Jim Crow Laws Passed in Virginia (1865 – 1967), Text Classification, Legislation, Legal Language Classification and Analysis with Machine Learning, Fine-Tuned DistilBERT, Corpus of Virginia Laws
Language:
English
Publisher:
University of Virginia
Published Date:
April 23, 2024
Sponsoring Agency:
The Mellon Foundation Sub-Award From The University of North Carolina at Chapel Hill Libraries
Notes:

This poster won an Honorable Mention and a $750 travel award at the Research Computing Exhibition at UVA on April 24, 2024.

I want to give special thanks to the Mellon Foundation, the University of North Carolina at Chapel Hill Libraries, the University of Virginia Library, and the Law Library for creating this phenomenal project and opportunity.

Special thanks to the entire University of Virginia Modeling a Racial Caste System (MRCS) project team for their continued encouragement and for allowing me to learn and lead the technical analysis for the project.

Special thanks to Marcus Bobar for advising me throughout the technical analysis and for providing impeccable feedback on the poster draft.

Special thanks to Amanda Henley for her continued support, indispensable feedback, and edits on the poster draft.

Special thanks to Jennifer Huck for her continued encouragement, support, and feedback on the poster and throughout this process.

Lastly, special thanks to the University of Virginia Research Computing office; without their resources, the technical analysis and model fine-tuning would have been impossible.