ICLR 2025 Workshop
Building Trust in LLMs and LLM Applications:
From Guardrails to Explainability to Regulation

Poster Session 2 (5:00pm-6:00pm)

AntifakePrompt: Prompt-Tuned Vision-Language Models are Fake Image Detectors

Dynaseal: A Backend-Controlled LLM API Key Distribution Scheme with Constrained Invocation Parameters

How Does Entropy Influence Modern Text-to-SQL Systems?

Language Models Use Trigonometry to Do Addition

Black-Box Adversarial Attacks on LLM-Based Code Completion

Learning Automata from Demonstrations, Examples, and Natural Language

CleanGen: Mitigating Backdoor Attacks for Generation Tasks in Large Language Models

MFC-Bench: Benchmarking Multimodal Fact-Checking with Large Vision-Language Models

Privately Learning from Graphs with Applications in Fine-tuning Large Pretrained Models

Automated Red Teaming with GOAT: the Generative Offensive Agent Tester

Building Bridges, Not Walls: Advancing Interpretability by Unifying Feature, Data, and Model Component Attribution

ToolScan: A Benchmark For Characterizing Errors In Tool-Use LLMs

Model Evaluations Need Rigorous and Transparent Human Baselines

Automated Feature Labeling with Token-Space Gradient Descent

Automated Capability Discovery via Model Self-Exploration

ExpProof : Operationalizing Explanations for Confidential Models with ZKPs

Boosting Adversarial Robustness of Vision-Language Pre-training Models against Multimodal Adversarial attacks

Evaluation of Large Language Models via Coupled Token Generation

Red Teaming for Trust: Evaluating Multicultural and Multilingual AI Systems in Asia-Pacific

Has My System Prompt Been Used? Large Language Model Prompt Membership Inference

Why Do Multiagent Systems Fail?

Do Multilingual LLMs Think In English?

Monitoring LLM Agents for Sequentially Contextual Harm

BaxBench: Can LLMs Generate Correct and Secure Backends?

Integrated Gradients Provides Faithful Language Model Attributions for In-Context Learning

HaluEval-Wild: Evaluating Hallucinations of Language Models in the Wild

MALIBU Benchmark: Multi-Agent LLM Implicit Bias Uncovered

Disentangling Sequence Memorization and General Capability in Large Language Models

Unlearning Geo-Cultural Stereotypes in Multilingual LLMs

On the Role of Prompt Multiplicity in LLM Hallucination Evaluation

The Fundamental Limits of LLM Unlearning: Complexity-Theoretic Barriers and Provably Optimal Protocols

An Empirical Study on Prompt Compression for Large Language Models

Token-Level Adversarial Prompt Detection Based on Perplexity Measures and Contextual Information

Temporally Sparse Attack for Fooling Large Language Models in Time Series Forecasting

Maybe I Should Not Answer That, but... Do LLMs Understand The Safety of Their Inputs?

Endive: A Cross-Dialect Benchmark for Fairness and Performance in Large Language Models

Enhancing CBMs Through Binary Distillation with Applications to Test-Time Intervention

Understanding (Un)Reliability of Steering Vectors in Language Models

Towards Understanding Distilled Reasoning Models: A Representational Approach

Enhancing LLM Robustness to Perturbed Instructions: An Empirical Study

Conformal Structured Prediction

LLM Neurosurgeon: Targeted Knowledge Removal in LLMs using Sparse Autoencoders

TEMPEST: Multi-Turn Jailbreaking of Large Language Models with Tree Search

The Steganographic Potentials of Language Models

Unlocking Hierarchical Concept Discovery in Language Models Through Geometric Regularization

StochasTok: Improving Fine-Grained Subword Understanding in LLMs

The Jailbreak Tax: How Useful are Your Jailbreak Outputs?

ASIDE: Architectural Separation of Instructions and Data in Language Models

Measuring In-Context Computation Complexity via Hidden State Prediction