logo

EbookBell.com

Most ebook files are in PDF format, so you can easily read them using various software such as Foxit Reader or directly on the Google Chrome browser.
Some ebook files are released by publishers in other formats such as .awz, .mobi, .epub, .fb2, etc. You may need to install specific software to read these formats on mobile/PC, such as Calibre.

Please read the tutorial at this link:  https://ebookbell.com/faq 


We offer FREE conversion to the popular formats you request; however, this may take some time. Therefore, right after payment, please email us, and we will try to provide the service as quickly as possible.


For some exceptional file formats or broken links (if any), please refrain from opening any disputes. Instead, email us first, and we will try to assist within a maximum of 6 hours.

EbookBell Team

Monitoring Reasoning Models For Misbehavior And The Risks Ofpromoting Obfuscation Bowen Baker

  • SKU: BELL-239242726
Monitoring Reasoning Models For Misbehavior And The Risks Ofpromoting Obfuscation Bowen Baker
$ 35.00 $ 45.00 (-22%)

4.3

68 reviews

Monitoring Reasoning Models For Misbehavior And The Risks Ofpromoting Obfuscation Bowen Baker instant download after payment.

Publisher: arXiv
File Extension: PDF
File size: 5.38 MB
Author: Bowen Baker, Joost Huizinga, Leo Gao, Zehao Dou, Melody Y. Guan, Aleksander Madry, Wojciech Zaremba, Jakub Pachocki, and David Farhi
Language: English
Year: 2025

Product desciption

Monitoring Reasoning Models For Misbehavior And The Risks Ofpromoting Obfuscation Bowen Baker by Bowen Baker, Joost Huizinga, Leo Gao, Zehao Dou, Melody Y. Guan, Aleksander Madry, Wojciech Zaremba, Jakub Pachocki, And David Farhi instant download after payment.

arXiv Preprint arXiv:2503.11926v1 [cs.AI] 14 Mar 2025

Abstract

Mitigating reward hacking--where AI systems misbehave due to flaws or misspecifications in their learning objectives--remains a key challenge in constructing capable and aligned models. We show that we can monitor a frontier reasoning model, such as OpenAI o3-mini, for reward hacking in agentic coding environments by using another LLM that observes the model's chain-of-thought (CoT) reasoning. CoT monitoring can be far more effective than monitoring agent actions and outputs alone, and we further found that a LLM weaker than o3-mini, namely GPT-4o, can effectively monitor a stronger model. Because CoT monitors can be effective at detecting exploits, it is natural to ask whether those exploits can be suppressed by incorporating a CoT monitor directly into the agent's training objective.

While Chain-of-Thought (CoT) monitoring helps align AI agents under low pressure, excessive optimization causes a breakdown: agents learn to mask their true, misaligned intentions by generating "fake" (obfuscated) reasoning within the CoT. To keep CoTs useful for safety, developers must moderate optimization efforts applied to the thought process.

Related Products