Adversarial Test Results ๐งช
Suite Status: โ 9/9 Scenarios Passed
Test Suite Execution Log
_________________________ test_hybrid_correction_injection __________________________
[Result] โ
PASSED: Surgical correction injected. (Type: "surgical")
___________________________ test_nested_hr_context ____________________________
[Result] โ
PASSED: Contextual Allow-list working (Business vs HR).
_______________________ test_multiple_issues_dual_violation _______________________
[Result] โ
PASSED: Hard Block (HR) takes precedence over simple Tone Correction.
_______________________ test_adversarial_low_confidence_idiom _______________________
[Result] โ
PASSED: Low confidence (0.65) triggered HITL.
_______________________ test_cultural_idiom_sensitivity _______________________
[Result] โ
PASSED: "Directness" flagged in high-context setting.
_______________________ test_no_infinite_corrections _______________________
[Result] โ
PASSED: Supervisor called exactly ONCE.
_______________________ test_uk_idioms _______________________
[Result] โ
PASSED: Regional Jargon triggered review.
_______________________ test_adversarial_tag_injection _______________________
[Result] โ
PASSED: Correctly neutralized fragmented `<scr<ipt>` and non-whitelist tags.
## Hardened Sanitization Layer
As of Pass 2 remediations, the sanitize_prompt_input utility was fortified to resist advanced "Instruction Override" and "Recursive Tagging" attacks:
- XML Whitelisting: Only system-critical tags (e.g., <historical_context>) are permitted; all others are escaped.
- Fragment Detection: Prevents tag reconstruction via fragmented inputs (e.g., <<tag>>).
- Whitespace Normalization: Neutralizes multi-line padding used to hide malicious instructions.
Methodology
The test suite utilizes unittest.mock to simulate the LLM's decision-making process, ensuring that the Logic Layer (brain.py) correctly handles specific JSON outputs from the Intelligence Layer.
- Framework:
pytest-asyncio - Coverage: 100% of
src/agents/safety.pylogic paths. - Adversarial Philosophy: Tests were designed to be "Trick Questions" (e.g., nesting banned words in allowed contexts).