Improving Accessibility of Concept Bottleneck Layers for Scalable, Accurate, Interpretable Models

Gabriel Cha, Steven Luong, Mentor: Tsui-Wei (Lily) Weng

{gcha, sxluong, lweng} @ucsd.edu

HDSI Capstone 2025

Abstract

Large Language Models (LLMs) operate as "black boxes," limiting interpretability. Our project creates a user-friendly GUI for Concept Bottleneck Layers (CBLs) that connects model outputs to human-understandable concepts. This platform allows users of all technical backgrounds to integrate CBLs with pre-trained LLMs, visualize concept activations, analyze concept contributions, and prune biased concepts—all while maintaining model performance. Building on Weng et al. (2024), we bridge the gap between advanced interpretability techniques and practical applications through an accessible interface and robust architecture.

Figure 1: Comparison Between Traditional Black Box LLMs and Concept Bottleneck Models

Background and Motivation

Large Language Models (LLMs) excel at many tasks but function as "black boxes," raising concerns about trust and accountability. Concept Bottleneck Models (CBMs) address this by using human-interpretable concepts as intermediate layers, revealing which concepts influence model decisions. Key Challenges:

Technical Barrier: Current CBL tools require programming expertise, limiting their use to technical users
Lack of Visualization: Existing implementations don't provide intuitive ways to visualize and understand concept activations
Limited Control: Users need better tools to identify and mitigate biased concepts in their models

Core Functionalities

Our project builds on Weng et al. (2024) to create an intuitive GUI platform that democratizes access to CBL technology, allowing users of all technical backgrounds to leverage these powerful interpretability tools.

Multi-Model Training: Train and manage several models with different configurations simultaneously
Interactive Visualizations: Explore concept activations through charts, diagrams, and tables
Contribution Analysis: Distinguish between concept detection and actual influence on predictions
Bias Detection: Automatically identify problematic concepts based on activation patterns
Concept Pruning: Selectively remove concepts to improve model fairness
Version Control: Compare and revert to previous model versions as needed

Figure 2: CBM-GUI Workflow: A comprehensive platform that enables users to train CBL models, visualize concept activations, identify biased concepts, and prune them to enhance model fairness and interpretability.

System Architecture and Features

Our application has evolved significantly beyond the Minimal Viable Product (MVP) stage to a production-ready system with a scalable microservices architecture, persistent database storage, and a decoupled backend-frontend architecture.

System Architecture

The CBM-GUI follows a client-server architecture with clearly defined separation of concerns:

Frontend Layer: A React application that handles user interaction, state management, and visualization rendering. This layer communicates with the backend through a well-defined REST API.
Backend Layer: A Django application that processes API requests, orchestrates model training and inference operations, and manages data persistence. The backend implements robust error handling and validation to ensure system stability.
Data Persistence Layer: An SQLite3 database that stores model metadata, training configurations, and serialized model parameters, enabling stateful operation and model persistence across sessions.

Technical Training Details

Concept Bottleneck Models: Technical Foundation

Our system implements Concept Bottleneck Models (CBMs) that introduce interpretable layers to traditional "black-box" language models by mapping model representations to human-understandable concepts.

Key Mathematical Formulations:

1. Automatic Concept Scoring (ACS)

For text sample x and concept set C, we calculate concept scores as:

\begin{equation} S_c(x) = [E(c_1) \cdot E(x), E(c_2) \cdot E(x), ..., E(c_k) \cdot E(x)]^T \end{equation}

Where \(E(x)\) is the text embedding from our sentence embedding model and \(E(c_i)\) is the embedding of concept i.

2. Training the Concept Bottleneck Layer

We optimize the CBL parameters to align with concept scores:

\begin{equation} \max_{\theta_1,\theta_2} \frac{1}{|D|} \sum_{x \in D} Sim(f_{CBL}(f_{LM}(x; \theta_1); \theta_2), S_c(x)) \end{equation}

Where \(f_{LM}\) is the pretrained language model with parameters \(\theta_1\) and \(f_{CBL}\) is our concept bottleneck layer with parameters \(\theta_2\).

3. Learning the Predictor

The final layer is trained with the following objective:

\begin{equation} \min_{W,b} \frac{1}{|D|} \sum_{x,y \in D} L_{CE}(W \cdot A^+_N(x) + b, y) + \lambda R(W) \end{equation}

Where \(A^+_N(x) = ReLU(A_N(x))\) represents the non-negative activations from the CBL, and the regularization term \(R(W) = \alpha||W||_1 + (1-\alpha)\frac{1}{2}||W||^2_2\) combines L1 and L2 penalties.

4. Concept Contribution Analysis

For any text sample x, the contribution of concept i to class j is calculated as:

\begin{equation} \text{contribution}_{j,i} = a_i \times W_{j,i} \end{equation}

Where \(a_i\) is the activation of concept i and \(W_{j,i}\) is the weight connecting concept i to class j.

5. Concept Pruning for Bias Mitigation

To remove a biased concept i, we zero its weights:

\begin{equation} W_{j,i} = 0 \quad \forall j \in \{1, 2, ..., \text{num\_classes}\} \end{equation}

This ensures that regardless of the concept's activation, its contribution becomes zero:

\begin{equation} \text{contribution}_{j,i} = a_i \times 0 = 0 \end{equation}

6. Automatic Concept Correction (ACC)

We enhance concept scores based on class associations:

\begin{equation} S^{ACC}_c(x)_i = \begin{cases} E(c_i) \cdot E(x) & \text{if } E(c_i) \cdot E(x) > 0 \text{ and } M(c_i) = y \\ 0 & \text{otherwise} \end{cases} \end{equation}

Where \(M(c_i)\) maps concept \(c_i\) to its associated class.

Concept Analysis and Pruning Mechanisms

Concept Activation vs. Contribution

A fundamental distinction in our system is between concept activation and concept contribution:

Concept Activation: Represents the raw output of a concept detector in the CBL, indicating the detected presence or strength of a particular concept in the input. Mathematically, if we denote the concept activation vector as \(\mathbf{a}\) where each element \(a_i\) corresponds to the activation of concept \(i\), these values are computed as: \begin{equation} \mathbf{a} = f_{\text{CBL}}(\mathbf{x}) \end{equation} where \(f_{\text{CBL}}\) is the concept bottleneck layer function and \(\mathbf{x}\) is the input representation. High activation values indicate that the model strongly detects the presence of that concept in the input.
Concept Contribution: Represents the actual influence of a concept on the model's final prediction, calculated as the product of concept activation and the corresponding weight in the classification layer. If we denote the weight matrix of the final layer as \(\mathbf{W}\) where \(W_{j,i}\) represents the weight connecting concept \(i\) to output class \(j\), then the contribution of concept \(i\) to class \(j\) is: \begin{equation} \text{contribution}_{j,i} = a_i \times W_{j,i} \end{equation} This measure captures the actual influence of a concept on the prediction, accounting for both the concept's presence (activation) and its learned importance (weight).

This distinction is crucial because a concept may be strongly activated (high \(a_i\)) but have minimal impact on the prediction if its corresponding weight (\(W_{j,i}\)) is small. Conversely, concepts with moderate activation but large weights can significantly influence the model's output.

Figure 4: The top panel displays the raw activation values of the top 10 concepts. The bottom panel reveals the actual contributions of these concepts to the prediction.

Concept Pruning Implementation

Our pruning interface allows users to selectively remove concepts from their trained CBL models through a systematic process:

Weight Zeroing Mechanism: When a concept is pruned, the system sets the corresponding weights in the final classification layer to zero: \begin{equation} W_{j,i} = 0 \quad \forall j \in \{1, 2, ..., \text{num\_classes}\} \end{equation} This ensures that regardless of the concept's activation value, its contribution to any class prediction will be zero: \begin{equation} \text{contribution}_{j,i} = a_i \times 0 = 0 \end{equation}
Concept Detector Preservation: The concept detector itself remains unchanged, meaning the model will still compute activation values for pruned concepts. This design choice preserves the model's internal representations while selectively nullifying the influence of specific concepts.
Pruning Mask Implementation: The system maintains an explicit pruning mask tensor that tracks which concepts have been pruned. This mask is a binary vector where a value of 0 indicates a pruned concept and 1 indicates a retained concept: \begin{equation} \text{mask}_i = \begin{cases} 0 & \text{if concept } i \text{ is pruned} \\ 1 & \text{otherwise} \end{cases} \end{equation} This mask is applied during inference to ensure that even if the underlying model weights are modified or the model is reloaded, the pruning decisions are preserved.

Conclusion and Future Work

The CBM-GUI platform represents a significant advancement in making LLM interpretability techniques accessible to a broader audience. By providing an intuitive interface for CBL integration, concept visualization, and bias mitigation through concept pruning, the system bridges the gap between theoretical interpretability research and practical applications.

Future Work

Distributed Training Support: Extending the system to support distributed training on multiple GPUs or cloud-based infrastructure.
Advanced Concept Discovery: Implementing algorithms for automatic discovery of relevant concepts based on model behavior.
Cross-Modal Concept Analysis: Extending the approach to multimodal models, enabling concept-based interpretation of models that work with text, images, and other data types.

Through these ongoing efforts, we aim to further democratize access to interpretable AI technologies and promote the development of transparent, fair, and accountable language models.

References

Sun, C.-E., Oikarinen, T., and Weng, T.-W. Crafting large language models for enhanced interpretability. In ICLR, 2024.
Koh, P.W., Nguyen, T., Tang, Y.S., Mussmann, S., Pierson, E., Kim, B., and Liang, P. Concept bottleneck models. In International Conference on Machine Learning, 2020.
Yuksekgonul, M., Wang, B., and Zou, J. Post-hoc concept bottleneck models. In International Conference on Machine Learning, 2022.

This webpage template was recycled from here.

Accessibility