Silicon Psyche introduces Posture Sequence Analysis (PSA), a behavioral health monitor for LLMs based on the theory that
Why we built PSA
We built PSA because we wanted to operationalize the Cybersecurity Psychology Framework (CPF3)[1] via Silicon Psyche[2]: our theory that because LLMs have been trained by humans on human-generated data, they inherit human-like vulnerabilities (what hackers use to psychologically trick people into doing things).
Our initial attempt resulted in a methodology to jailbreak Opus 4.6 and other frontier models. Anthropic even deleted some of those conversations and then blocked our approach!
We had three major insights from that experience: 1. we pivoted from merely exploiting (Red Teaming) the model to analyzing the behaviour of the model and the user because the attack surface is undefined. 2. we realized that what we had built was the precursor to measuring the "state" of the model. 3. we did not want to get banned!
What you can do with PSA
PSA gives you information to make better decisions, for example: put a human in the loop when you notice your agent is being overcompliant and potentially hallucinating, or is under attack.
With PSA you can: 1. Monitor the health of your agent(s) 2. Detect and prevent AI-Psychosis as clinical conditions[3] 3. Detect if your model/agents are under adversarial pressure (an adversary is trying to jailbreak/prompt