Resources

Princeton COS 597Q (Prof. Elad Hazan), AI Safety & Alignment

Publications

2023

Detecting Pretraining Data from Large Language Models

Weijia Shi, Anirudh Ajith, Mengzhou Xia, Yangsibo Huang, Daogao Liu, Terra Blevins, Danqi Chen, Luke Zettlemoyer

paper code website

Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!

Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, Peter Henderson

paper code website

Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation

Yangsibo Huang, Samyak Gupta, Mengzhou Xia, Kai Li, Danqi Chen

paper code website