Large Language Models (LLMs) operate as "black boxes," limiting interpretability. Our project creates a user-friendly GUI for Concept Bottleneck Layers (CBLs) that connects model outputs to human-understandable concepts. This platform allows users of all technical backgrounds to integrate CBLs with pre-trained LLMs, visualize concept activations, analyze concept contributions, and prune biased concepts—all while maintaining model performance. Building on Weng et al. (2024), we bridge the gap between advanced interpretability techniques and practical applications through an accessible interface and robust architecture.
Figure 1: Comparison Between Traditional Black Box LLMs and Concept Bottleneck Models
Large Language Models (LLMs) excel at many tasks but function as "black boxes," raising concerns about trust and accountability. Concept Bottleneck Models (CBMs) address this by using human-interpretable concepts as intermediate layers, revealing which concepts influence model decisions. Key Challenges:
Our project builds on Weng et al. (2024) to create an intuitive GUI platform that democratizes access to CBL technology, allowing users of all technical backgrounds to leverage these powerful interpretability tools.
Figure 2: CBM-GUI Workflow: A comprehensive platform that enables users to train CBL models, visualize concept activations, identify biased concepts, and prune them to enhance model fairness and interpretability.
Our application has evolved significantly beyond the Minimal Viable Product (MVP) stage to a production-ready system with a scalable microservices architecture, persistent database storage, and a decoupled backend-frontend architecture.
Our system implements Concept Bottleneck Models (CBMs) that introduce interpretable layers to traditional "black-box" language models by mapping model representations to human-understandable concepts.
For text sample x and concept set C, we calculate concept scores as:
\begin{equation} S_c(x) = [E(c_1) \cdot E(x), E(c_2) \cdot E(x), ..., E(c_k) \cdot E(x)]^T \end{equation}Where \(E(x)\) is the text embedding from our sentence embedding model and \(E(c_i)\) is the embedding of concept i.
We optimize the CBL parameters to align with concept scores:
\begin{equation} \max_{\theta_1,\theta_2} \frac{1}{|D|} \sum_{x \in D} Sim(f_{CBL}(f_{LM}(x; \theta_1); \theta_2), S_c(x)) \end{equation}Where \(f_{LM}\) is the pretrained language model with parameters \(\theta_1\) and \(f_{CBL}\) is our concept bottleneck layer with parameters \(\theta_2\).
The final layer is trained with the following objective:
\begin{equation} \min_{W,b} \frac{1}{|D|} \sum_{x,y \in D} L_{CE}(W \cdot A^+_N(x) + b, y) + \lambda R(W) \end{equation}Where \(A^+_N(x) = ReLU(A_N(x))\) represents the non-negative activations from the CBL, and the regularization term \(R(W) = \alpha||W||_1 + (1-\alpha)\frac{1}{2}||W||^2_2\) combines L1 and L2 penalties.
For any text sample x, the contribution of concept i to class j is calculated as:
\begin{equation} \text{contribution}_{j,i} = a_i \times W_{j,i} \end{equation}Where \(a_i\) is the activation of concept i and \(W_{j,i}\) is the weight connecting concept i to class j.
To remove a biased concept i, we zero its weights:
\begin{equation} W_{j,i} = 0 \quad \forall j \in \{1, 2, ..., \text{num\_classes}\} \end{equation}This ensures that regardless of the concept's activation, its contribution becomes zero:
\begin{equation} \text{contribution}_{j,i} = a_i \times 0 = 0 \end{equation}We enhance concept scores based on class associations:
\begin{equation} S^{ACC}_c(x)_i = \begin{cases} E(c_i) \cdot E(x) & \text{if } E(c_i) \cdot E(x) > 0 \text{ and } M(c_i) = y \\ 0 & \text{otherwise} \end{cases} \end{equation}Where \(M(c_i)\) maps concept \(c_i\) to its associated class.
Figure 4: The top panel displays the raw activation values of the top 10 concepts. The bottom panel reveals the actual contributions of these concepts to the prediction.
The CBM-GUI platform represents a significant advancement in making LLM interpretability techniques accessible to a broader audience. By providing an intuitive interface for CBL integration, concept visualization, and bias mitigation through concept pruning, the system bridges the gap between theoretical interpretability research and practical applications.
Sun, C.-E., Oikarinen, T., and Weng, T.-W. Crafting large language models for enhanced interpretability. In ICLR, 2024.
Koh, P.W., Nguyen, T., Tang, Y.S., Mussmann, S., Pierson, E., Kim, B., and Liang, P. Concept bottleneck models. In International Conference on Machine Learning, 2020.
Yuksekgonul, M., Wang, B., and Zou, J. Post-hoc concept bottleneck models. In International Conference on Machine Learning, 2022.