ICLR 2025 Workshop
Building Trust in LLMs and LLM Applications:
From Guardrails to Explainability to Regulation

Poster Session 1 (11:15am-12:15pm)

AdvBDGen: A Robust Framework for Generating Adaptive and Stealthy Backdoors in LLM Alignment Attacks

Interpretable Steering of Large Language Models with Feature Guided Activation Additions

The Differences Between Direct Alignment Algorithms are a Blur

Towards Effective Discrimination Testing for Generative AI

Scalable Fingerprinting of Large Language Models

Prune 'n Predict: Optimizing LLM Decision-making with Conformal Prediction

SPEX: Scaling Feature Interaction Explanations for LLMs

Towards Understanding Fine-Tuning Mechanisms of LLMs via Circuit Analysis

Differentially Private Retrieval Augmented Generation with Random Projection

LLMS LOST IN TRANSLATION: M-ALERT UNCOVERS CROSS-LINGUISTIC SAFETY GAPS

Unnatural Languages Are Not Bugs but Features for LLMs

Working Memory Attack on LLMs

No, Of Course I Can! Refusal Mechanisms Can Be Exploited Using Harmless Data

VideoJail: Exploiting Video-Modality Vulnerabilities for Jailbreak Attacks on Multimodal Large Language Models

Analyzing Memorization in Large Language Models through the Lens of Model Attribution

PRUNING AS A DEFENSE: REDUCING MEMORIZATION IN LARGE LANGUAGE MODELS

Antipodal Pairing and Mechanistic Signals in Dense SAE Latents

Evaluating Text Humanlikeness via Self-Similarity Exponent

Self-Ablating Transformers: More Interpretability, Less Sparsity

Justified Trust in AI Fairness Assessment using Existing Metadata Entities

Disentangling Linguistic Features with Dimension-Wise Analysis of Vector Embeddings

Fast Proxies for LLM Robustness Evaluation

A False Sense of Privacy: Evaluating Textual Data Sanitization Beyond Surface-level Privacy Leakage

Rethinking LLM Bias Probing Using Lessons from the Social Sciences

AI Companions Are Not The Solution To Loneliness: Design Choices And Their Drawbacks

A Generative Approach to LLM Harmfulness Detection with Red Flag Tokens

A Missing Testbed for LLM Pre-Training Membership Inference Attacks

Harmful Helper: Perform malicious tasks? Web AI agents might help

Mechanistic Anomaly Detection for "Quirky'' Language Models

Systematic Evaluation of LLM-as-a-Judge in LLM Alignment Tasks: Explainable Metrics and Diverse Prompt Templates

Latent Adversarial Training Improves the Representation of Refusal

Reliable and Efficient Amortized Model-based Evaluation

SafeMERGE: Preserving Safety Alignment in Fine-Tuned Large Language Models via Selective Layer-Wise Model Merging

Top of the CLASS: Benchmarking LLM Agents on Real-World Enterprise Tasks

Evaluating and Mitigating the Safety Awareness-Execution Gaps of LM Agents

Diagnostic Uncertainty: Teaching Language Models to Describe Open-Ended Uncertainty

AegisLLM: Scaling Agentic Systems for Self-Reflective Defense in LLM Security

Steering Fine-Tuning Generalization with Targeted Concept Ablation

PATTERNS AND MECHANISMS OF CONTRASTIVE ACTIVATION ENGINEERING

Hidden No More: Attack and Defending Private Third-Party LLM Inference

MKA: Leveraging Cross-Lingual Consensus for Model Abstention

Finding Sparse Autoencoder Representations Of Errors In CoT Prompting

A Flexible Large Language Models Guardrail Development Methodology Applied to Off-Topic Prompt Detection

Can Knowledge Graphs Make Large Language Models More Trustworthy? An Empirical Study Over Open-ended Question Answering

FiDeLiS: Faithful Reasoning in Large Language Models for Knowledge Graph Question Answering

Mind the Gap: A Practical Attack on GGUF Quantization

On-Premises LLM Deployment Demands a Middle Path: Preserving Privacy Without Sacrificing Model

In-Context Meta Learning Induces Multi-Phase Circuit Emergence

GASP: Efficient Black-Box Generation of Adversarial Suffixes for Jailbreaking LLMs