The Problem
When a language model has been fine-tuned to believe X, but its pretraining says Y — which wins? And more importantly, can you steer that behavior without retraining?
What We Built
Cross-format data augmentation pipeline for ConflictQA and ConflictBank. Generates semantically equivalent question variants to stress-test knowledge conflict handling across paraphrase and format shifts — essential for building benchmarks that don’t overfit to surface form.
Activation steering experiments: extracted fine-tuned behavioral vectors from LoRA-trained models and applied them to frozen base models. The behavioral direction is surprisingly transferable — a steering vector from a LoRA-trained model can push a base model’s completions in the expected direction with zero weight updates.
Key Finding
Fine-tuning injects directional bias into the residual stream in a way that’s partially separable from task knowledge. If you can steer behavior without weights, you can also potentially undo fine-tuning signals. This has direct implications for alignment work.