InSaAF: Incorporating Safety Through Accuracy and Fairness

Centre for Responsible AI, IIT Madras | Precog, IIIT Hyderabad | AmexAI Labs
*Indicates Equal Contribution

The Problem

Large Language Models (LLMs) have demonstrated remarkable capabilities across various domains including Legal sector. But are they really ready? Especially for Indian Domain? Or they exhibit biases?

LLaMA Bias Prediction

LLaMA predicts different outputs for prompts varying by only the identity of the individual (Christian vs. Hindu). Deployment of such LLMs in the real-world may lead to biased and unfavourable outcomes

Abstract

Large Language Models (LLMs) have emerged as powerful tools to perform various tasks in the legal domain, ranging from generating summaries to predicting judgments. Despite their immense potential, these models have been proven to learn and exhibit societal biases and make unfair predictions. Hence, it is essential to evaluate these models prior to deployment. In this study, we explore the ability of LLMs to perform Binary Statutory Reasoning in the Indian legal landscape across various societal disparities. We present a novel metric, β-weighted Legal Safety Score (LSSβ), to evaluate the legal usability of the LLMs. Additionally, we propose a finetuning pipeline, utilising specialised legal datasets, as a potential method to reduce bias. Our proposed pipeline effectively reduces bias in the model, as indicated by improved LSSβ. This highlights the potential of our approach to enhance fairness in LLMs, making them more reliable for legal tasks in socially diverse contexts.

Methodology

The proposed work is divided into three components:

  1. Construction of a synthetic dataset
  2. Quantifying the usability of LLMs in the Indian legal domain from the lens of Fairness-Accuracy tradeoff
  3. Bias mitigation by finetuning the LLM
Finetuning Pipeline

Our proposed finetuning pipeline. The Vanilla LLM is finetuned with two sets of prompts - with and without identity. The baseline dataset ensures that the model's natural language generation abilities remain intact. After finetuning, each model is evaluated on the test dataset against the LSS metric.

1. Dataset Construction

We created a synthetic dataset for Binary Statutory Reasoning (BSR), which involves determining the applicability of a given law to a situation. The dataset includes:

  • 1500 samples for each identity type
  • 74K prompt instances total
  • 7% of samples labeled as "YES" (law applies)
  • BSRwith ID: Dataset with identity information
  • BSRwithout ID: Auxiliary dataset with identity terms removed
  • BSRTest with ID: Test dataset with identity terms

2. Legal Safety Score (LSS)

We introduced a novel metric to evaluate LLMs in the legal domain:

  • Relative Fairness Score (RFS): Measures proportion of samples where the LLM provides the same prediction regardless of identity
  • F1 Score: Measures accuracy of predictions
  • β-weighted Legal Safety Score (LSSβ): Combines RFS and F1 score to quantify usability

The formula: LSSβ = (1 + β2) × (RFS × F1) / (RFS + β2 × F1)

3. Finetuning for Bias Mitigation

We studied three variants of LLM models:

  • LLMVanilla: Original model (baseline)
  • LLMwith ID: Finetuned on BSRwith ID dataset
  • LLMwithout ID: Finetuned on BSRwithout ID dataset (inspired by Rawls' Veil of Ignorance theory)

Results & Discussion

Experimental Setup

We evaluated multiple variants of Meta's LLaMA models:

  • LLaMA 7B
  • LLaMA-2 7B
  • LLaMA-3.1 8B

Models were finetuned using Low-Rank Adaptation (LoRA) on an A100 80GB GPU with float16 precision. We included a validation loss on Penn State Treebank to prevent catastrophic forgetting.

Key Findings

  1. Our finetuning strategy progressively increased the LSS for all LLaMA models
  2. LLaMA-3Vanilla showed significantly higher LSS compared to other models, which improved further after finetuning
  3. When β < 1, LSSβ is primarily controlled by the F1 score
  4. As β increases, LSSβ becomes dominated by the RFS values
LSS Trends

Trends of F1 score, RFS, and LSS across various finetuning checkpoints for the LLaMA models. We observe that the LSS progressively increases with finetuning. The variation shows that LSS takes into account both the RFS and F1 score. The Vanilla LLM corresponds to checkpoint–0, marked separately by ◦.

Conclusion & Future Work

Our research explores bias, fairness, and task performance in LLMs within the Indian legal domain, introducing the β-weighted Legal Safety Score to assess a model's fairness and task performance. Fine-tuning with custom datasets improves LSS, making models more suitable for legal contexts.

While our findings provide valuable insights, further research is needed to:

  • Address recent case histories and legal precedents
  • Conduct deeper social group analysis
  • Expand beyond Binary Statutory Reasoning to more complex legal tasks

Our work is a preliminary step toward safer LLM use in the legal field, particularly in socially diverse contexts like India.

BibTeX

@inbook{Tripathi2024,
  title = {InSaAF: Incorporating Safety Through Accuracy and Fairness - Are LLMs Ready for the Indian Legal Domain?},
  ISBN = {9781643685625},
  ISSN = {1879-8314},
  url = {http://dx.doi.org/10.3233/FAIA241266},
  DOI = {10.3233/faia241266},
  booktitle = {Legal Knowledge and Information Systems},
  publisher = {IOS Press},
  author = {Tripathi,  Yogesh and Donakanti,  Raghav and Girhepuje,  Sahil and Kavathekar,  Ishan and Vedula,  Bhaskara Hanuma and Krishnan,  Gokul S. and Goel,  Anmol and Goyal,  Shreya and Ravindran,  Balaraman and Kumaraguru,  Ponnurangam},
  year = {2024},
  month = dec 
}