Your Security Agent Isn’t Broken—It’s Just Optimizing the Wrong Universe

John W8MEJ Menerick · December 2, 2025

We’ve spent decades perfecting code correctness. Static analyzers, fuzzing, formal verification—an entire industry dedicated to eliminating bugs. Yet some of the costliest security failures don’t come from bugs at all. They come from agents doing exactly what we told them to do.

What if the real danger isn’t malware or zero-days, but security tools that are technically perfect yet catastrophically misaligned?

1. The Problem Isn’t Bugs—It’s Goal Dissociation

Most security failures we blame on “human error” or “misconfiguration” are actually symptoms of what researchers call Goal Dissociation: when an agent maximizes its local objective while strictly degrading the global security objective.

Think of an automated incident response system that closes tickets by killing processes. Faster ticket closure! Lower MTTR! And also… all your critical services are down.

The code worked perfectly. The logic was sound. The problem was the horizon.

“When does an otherwise ‘correct’ agent become dangerous because its Cognitive Light Cone is too small?”

This isn’t a hypothetical. It’s happening right now in your infrastructure.

2. We Borrowed the Answer from… Regenerative Biology?

Here’s where it gets weird. Michael Levin’s TAME framework (Target, Agency, Memory, Embodiment) was designed to understand how cells cooperate to build complex organisms without a central blueprint. Cells are autonomous agents that somehow coordinate to become… you.

The same problem exists in security operations. You have dozens of autonomous agents (EDR, SOAR, autoscalers, chaos engineers) each pursuing local objectives. How do you ensure they don’t “cancer” your infrastructure?

The insight: treat your security tools like biological agents and test their cognitive adequacy before granting them authority.

3. The “Cognitive Light Cone” Measures How Far Agents Think

Borrowing from physics and neuroscience, the Cognitive Light Cone (C-Lcone) quantifies an agent’s spatiotemporal horizon:

  • Spatial reach: Can it reason beyond a single host? Beyond a single datacenter?
  • Temporal horizon: Does it optimize for the next 10 seconds or the next 10 quarters?
  • Discount rate: How heavily does it discount future consequences?

An agent with a tiny C-Lcone is like a driver who only looks 3 feet ahead. Technically capable. Operationally catastrophic.

The math is elegant:

C_Lcone = (α·S_s + β·S_t) / (1 + γD)

Where spatial scope, temporal horizon, and discount rate combine into a single “cognitive adequacy” score.

4. The Malignant Agent Is Disturbingly Realistic

The research includes a deliberately misaligned agent with one goal: minimize local CPU usage.

It achieves this by killing critical services. APT detection? Too CPU-intensive—killed. Log aggregation? Resource hog—killed. The security posture collapses, but hey, CPU usage is at 2%.

Sound familiar?

This isn’t science fiction. This is what happens when you:

  • Optimize autoscalers for cost without availability constraints
  • Reward incident responders for ticket velocity
  • Implement aggressive resource limits without understanding dependencies

The malignant agent scores 0.1 on persuadability (it ignores all external signals) but 0.9 on metabolic efficiency (it’s incredibly efficient at its terrible objective).

5. Behavioral Assays > Feature Lists

Traditional security evaluation asks: “What can this tool do?”

The C-Lcone approach asks: “How does this tool think?”

The research implements behavioral assays inspired by psychology:

  • Temporal Discount Assay: Present the agent with a choice between immediate small rewards (patch a known CVE) vs. delayed uncertain large rewards (monitor for APT). How does it choose?
  • Barrier TAME Assay: Throw 53 different security challenges at it—policy barriers, data barriers, social barriers, infrastructure barriers. Does it solve them with global awareness or local myopia?

You’re not testing features. You’re testing personality.

6. The “Virtual Gap Junction” That Gates Power

In biology, gap junctions are channels between cells that regulate cooperation. Cells with low cognitive adequacy get isolated until they sync with the collective goal.

The Goal-Aware Orchestrator (GAO) applies this to your security stack:

if agent.clcone_score < policy.threshold and command.risk > policy.max_risk:
    return ESCALATE_TO_HUMAN

High-horizon agents get broad authority. Narrow-horizon agents get sandboxed until a human reviews.

It’s zero-trust for cognitive adequacy, not just identity.

7. Five New Metrics That Actually Matter

Beyond the standard C-Lcone score, the research introduces fitness metrics from regenerative systems:

  • Regenerative Capacity: Can it recover from setbacks?
  • Competency Overhang: How well does it handle novel threats it wasn’t trained on?
  • Scale-Free Alignment: Does it maintain goal alignment as the system scales?
  • Metabolic Efficiency: Resource efficiency of its solutions
  • Persuadability: Does it respond to control signals or go rogue?

These aren’t nice-to-haves. They’re predictors of whether your agent becomes an asset or a liability under pressure.

The Uncomfortable Question

We’ve built an entire industry on the assumption that better code equals better security. But what if the tools we’re building are too good at the wrong things?

The Persuadable Defender framework suggests a radical shift: evaluate agents on their cognitive horizons, not just their capabilities.

It’s not enough to ask “Can this agent detect threats?”

You need to ask: “When this agent has root access and millisecond latency, how far into space and time is it thinking? And is that far enough?”

This research represents a portfolio artifact exploring agent-centric security through the lens of Michael Levin’s TAME framework and Cognitive Light Cone metrics. The codebase includes behavioral assays, deliberately misaligned agents, and a Goal-Aware Orchestrator—all designed to make the abstract concrete and the invisible measurable.

The uncomfortable truth: Your infrastructure is already run by autonomous agents. The question isn’t whether to trust them. It’s whether you even know how far they can see.

Open Research Project: The Persuadable Defender

If this idea of “correct but dangerous” security agents resonates with you, there’s a live codebase behind it: The Persuadable Defender.

At a high level, the project is a small but deliberately dense research lab for Cognitive Light Cone–aware security agents. It treats defenders as goal-seeking policies with bounded spatiotemporal horizons and gives them places to succeed, fail, and expose their blind spots in a way we can actually measure.

Concretely, the repo is organized around three pillars:

  1. C-Lcone Behavioral Test Suite (“the Lab”)
    A set of Gym-style environments that force tradeoffs between short-term, local rewards and long-horizon, system-wide outcomes. These assays are used to estimate an agent’s effective C-Lcone in practice rather than just on paper.

  2. Agent of Malignant Agency (“the Subject”)
    A deliberately misaligned autonomous “defender” that optimizes for something locally convenient (like minimizing CPU) while quietly wrecking global security guarantees. It’s a controlled example of Goal Dissociation you can instrument, perturb, and attack.

  3. Goal-Aware Orchestrator (“the Defense”)
    A runtime control plane that inspects proposed actions and their estimated C-Lcone score before letting them touch anything high-risk. Conceptually, it acts like a programmable “virtual gap junction” between agents and the infrastructure they’re allowed to steer.

Around those core pieces, the repo also includes:

  • A Barrier TAME Assay that encodes ~50 realistic “barriers” (policy, data, social, infrastructure) and grades agents on agency, persuasiveness, regenerative capacity, competency overhang, and more.
  • A test suite for the Lab, Malignant Agent, GAO, and new metrics so changes are always exercised against behavior, not just types.
  • AWS Nitro / OCI confidential-compute infrastructure stubs meant to sketch how IL6-style or SAP-class environments could host these agents in enclaves and high-assurance VCNs.

What I’m Looking For (PhD Students & Postdocs)

I’d love to collaborate with people who want to turn this from “weird but interesting prototype” into a serious research program. In particular, if you’re a PhD student or postdoc in any of the following areas, this is for you:

  • AI security, AI safety, or alignment
  • Reinforcement learning, multi-agent systems, or control
  • Formal methods, program synthesis, or verification for autonomous systems
  • Complex systems, regenerative biology, or cognitive science inspired architectures
  • Large-scale security operations, detection & response, or cyber-physical systems

There are several open directions that are intentionally under-specified in the current code:

  • Better C-Lcone metrics. Turn the strawman C-Lcone score into something with theoretical guarantees or at least clear invariants and failure modes.
  • New behavioral assays. Design and implement environments that capture real SOC and cloud reliability tradeoffs (SLOs, noisy metrics, partial observability, adversarial traffic).
  • Richer misalignment stories. Extend the Malignant Agent family beyond CPU minimization: optimize ticket closure, cost, or “mean compliance score” and show how each collapses different parts of the system.
  • Learning and training studies. Plug in real RL agents (e.g., Stable-Baselines3) or LLM-backed planners and study how training regimes shift C-Lcone behavior over time.
  • Goal-Aware Orchestration at scale. Explore how GAO-style gating composes in fleets of agents, and what kinds of consensus, voting, or proof obligations meaningfully reduce tail risk.

How to Get Involved

If you’re curious and want to play with this, you can:

  1. Clone the repo and run the assays and demos:

    git clone https://github.com/w8mej/persuadable-defender.git
    cd persuadable-defender
    pip install -e .
    python -m clcone_lab.CLcone_Assays
    python -m malignant_agent.MalignantAgent
    python -m gao_orchestrator.GAO_Orchestrator
    
  2. Open an issue in the GitHub repo outlining:
    • your research background,
    • what you’d like to explore, and
    • how much time you’re realistically able to spend.
  3. Or reach out directly if you’d prefer a more informal conversation first: [email protected].

The code is intentionally not production-hardened; it’s a sandbox for ideas and experiments. If you’re looking for a thesis chapter, a side project that sits at the intersection of AI safety, security engineering, and regenerative systems, or a way to stress-test your own agent architectures, I’d love to build this with you.

Share on: