Poster Session 1 (11:15am-12:15pm)
AdvBDGen: A Robust Framework for Generating Adaptive and Stealthy Backdoors in LLM Alignment Attacks
Interpretable Steering of Large Language Models with Feature Guided Activation Additions
The Differences Between Direct Alignment Algorithms are a Blur
Towards Effective Discrimination Testing for Generative AI
Scalable Fingerprinting of Large Language Models
Prune 'n Predict: Optimizing LLM Decision-making with Conformal Prediction
SPEX: Scaling Feature Interaction Explanations for LLMs
Towards Understanding Fine-Tuning Mechanisms of LLMs via Circuit Analysis
Differentially Private Retrieval Augmented Generation with Random Projection
LLMS LOST IN TRANSLATION: M-ALERT UNCOVERS CROSS-LINGUISTIC SAFETY GAPS
Unnatural Languages Are Not Bugs but Features for LLMs
Working Memory Attack on LLMs
No, Of Course I Can! Refusal Mechanisms Can Be Exploited Using Harmless Data
VideoJail: Exploiting Video-Modality Vulnerabilities for Jailbreak Attacks on Multimodal Large Language Models
Analyzing Memorization in Large Language Models through the Lens of Model Attribution
PRUNING AS A DEFENSE: REDUCING MEMORIZATION IN LARGE LANGUAGE MODELS
Antipodal Pairing and Mechanistic Signals in Dense SAE Latents
Evaluating Text Humanlikeness via Self-Similarity Exponent
Self-Ablating Transformers: More Interpretability, Less Sparsity
Justified Trust in AI Fairness Assessment using Existing Metadata Entities
Disentangling Linguistic Features with Dimension-Wise Analysis of Vector Embeddings
Fast Proxies for LLM Robustness Evaluation
A False Sense of Privacy: Evaluating Textual Data Sanitization Beyond Surface-level Privacy Leakage
Rethinking LLM Bias Probing Using Lessons from the Social Sciences
AI Companions Are Not The Solution To Loneliness: Design Choices And Their Drawbacks
A Generative Approach to LLM Harmfulness Detection with Red Flag Tokens
A Missing Testbed for LLM Pre-Training Membership Inference Attacks
Harmful Helper: Perform malicious tasks? Web AI agents might help
Mechanistic Anomaly Detection for "Quirky'' Language Models
Systematic Evaluation of LLM-as-a-Judge in LLM Alignment Tasks: Explainable Metrics and Diverse Prompt Templates
Latent Adversarial Training Improves the Representation of Refusal
Reliable and Efficient Amortized Model-based Evaluation
SafeMERGE: Preserving Safety Alignment in Fine-Tuned Large Language Models via Selective Layer-Wise Model Merging
Top of the CLASS: Benchmarking LLM Agents on Real-World Enterprise Tasks
Evaluating and Mitigating the Safety Awareness-Execution Gaps of LM Agents
Diagnostic Uncertainty: Teaching Language Models to Describe Open-Ended Uncertainty
AegisLLM: Scaling Agentic Systems for Self-Reflective Defense in LLM Security
Steering Fine-Tuning Generalization with Targeted Concept Ablation
PATTERNS AND MECHANISMS OF CONTRASTIVE ACTIVATION ENGINEERING
Hidden No More: Attack and Defending Private Third-Party LLM Inference
MKA: Leveraging Cross-Lingual Consensus for Model Abstention
Finding Sparse Autoencoder Representations Of Errors In CoT Prompting
A Flexible Large Language Models Guardrail Development Methodology Applied to Off-Topic Prompt Detection
Can Knowledge Graphs Make Large Language Models More Trustworthy? An Empirical Study Over Open-ended Question Answering
FiDeLiS: Faithful Reasoning in Large Language Models for Knowledge Graph Question Answering
Mind the Gap: A Practical Attack on GGUF Quantization
On-Premises LLM Deployment Demands a Middle Path: Preserving Privacy Without Sacrificing Model
In-Context Meta Learning Induces Multi-Phase Circuit Emergence
GASP: Efficient Black-Box Generation of Adversarial Suffixes for Jailbreaking LLMs