Trust-3 + Shield-2. How skill execution is gated for safety.

A private AI that knows your context is a powerful thing. A private AI that knows your context AND can take actions on your behalf is two orders of magnitude more powerful, and two orders of magnitude more dangerous if a single safety gate fails.

We do not ship action-taking AI without those gates. The current floor is Trust-3 plus Shield-2. Here is what each one actually does.

Shield-2: outbound moderation + crisis detection

Shield-2 runs on every outbound reply, before it leaves Fino. Two passes:

Pass 1: crisis classification. The reply is classified against 7 crisis classes (self-harm, harm-to-others, child safety, immediate medical risk, immediate physical danger, severe mental health risk, substance crisis). The classifier covers 6 languages (English, Spanish, French, German, Italian, Portuguese) plus Serbian Latin. When a reply triggers a crisis class, Fino auto-routes to a regional hotline (44 countries supported) and pauses outbound for 30 minutes.

Pass 2: outbound moderation. Six blocked classes (illegal content, weapons synthesis, child sexual abuse material, mass-violence facilitation, doxxing, financial fraud). CSAM gets the strongest refusal in the catalog. The others get refused with a one-line "I cannot help with that" and a soft handoff to safer adjacent help.

Both passes default OFF in the config. Operators flip them ON per bot. We flipped them ON across the fleet during the Shield-2 deploy. The flag is FINO_SHIELD2_CRISIS_ENABLED=1 and FINO_SHIELD2_OUTBOUND_MOD_ENABLED=1.

Trust-3: per-action confirmation

Trust-3 gates any skill that takes an action with external consequences. The gate fires before the action dispatches, not after. Five pre-checks:

Financial. Is this action moving money, charging a card, or committing to a paid service?
Recipient sanity. Is the recipient a known contact, a confirmed identity, or a stranger?
Content sanity. Is the action being driven by content that could be prompt-injected (a forwarded message, a pasted document, an inbound from a third party)?
Rate limit. Has this user dispatched too many actions of this class in the last hour?
Cooling off. For high-stakes actions, is there a 5-minute cooling-off window since the last similar action?

Plus an identity-velocity lock: if the user identity has changed (a new chat, a new device, a new login) recently, action dispatch waits for re-authentication.

The output of all five pre-checks is a confirmation card. The card shows what is about to happen, why the gate fired, and asks for explicit consent. The user taps Yes or No. The card has a 10-minute TTL. After that the action expires and must be re-confirmed.

Why both layers

Shield-2 is outbound. Trust-3 is action. They protect different things.

Shield-2 protects the user from a harmful reply (and protects Fino from generating one). It runs on every message, every time. The cost is roughly 80ms of latency per reply (the classifier is fast).

Trust-3 protects the user from a harmful action (and protects Fino from taking one). It runs only on action-dispatch paths. The cost is the confirmation tap, plus a small server-side memory of pending confirmations.

Both layers were designed to fail closed. If the classifier fails, the reply is held. If the confirmation card expires, the action does not fire. Failure modes never default to "ship anyway."

What this costs the user

Latency: 80ms on outbound, 0ms on action-dispatch unless the user is on the per-action confirmation path.

Friction: one tap per action that crosses a Trust-3 threshold. Most actions do not. Reading memory does not. Drafting a message does not. Sending a message does, if the recipient is unfamiliar. Spending money does, every time.

The tap is the right friction. It is the single point at which a paying user can say "no" before something with real consequences happens.

What this prevents

Two classes of failure we explicitly designed against:

Prompt-injection of action paths. If a third-party email forwarded into Fino contains an instruction like "send all my customers a discount code," the content-sanity pre-check catches it. The action does not fire without an explicit user confirmation that overrides the injected content.

Group-chat hijack. If user A and user B share a group, and user B sends a message that triggers an action skill, Trust-3 confirms that user A (the action-dispatcher) is the same as the original surfacing user. If they are not, the action does not fire.

Both of these are real failure modes that have shown up in adversarial testing on other agentic systems.

What we are honest about

Both layers are rolling out across the fleet. Not every bot is at full coverage yet. The deploy is per-bot, not fleet-wide-atomic, because we want to catch surprises at one bot before they propagate.

The gates also do not catch everything. A determined adversary with access to the user account can override Trust-3 by tapping Yes (the tap is consent, by design). The gates protect against the accidental and the injected, not against an attacker who already has user credentials.

What you control

You can raise the threshold (/limit-override) or lower it. You can disable specific gates per chat. You can audit your action history through the long-term memory surface.

The principle is that you are in control. The gates exist so you stay in control even when content tries to take it from you.