Back to blog

AI Strategy

When AI Stops ‘Answering’ and Starts Operating Your Computer

9 min read · Published February 24, 2026 · Updated February 24, 2026

By CogLab Editorial Team · Reviewed by Knyckolas Sutherland

Yesterday, Standard Intelligence published a post with a deceptively simple claim: they trained a foundation model for computer use that learns directly from video, not from a pile of human-labeled screenshots. If you’ve been watching the last year of ‘computer use agents’ demos—models clicking buttons, filling forms, fumbling around web apps—this is the part where the genre is supposed to get serious. Not because the demos are flashier, but because the training recipe is aiming at scale.

Their model, FDM-1, is built around an idea that’s easy to say and hard to execute: treat the raw screen recording of human computer activity as the training data, infer the actions (mouse movements, key presses) automatically, and then train a model with enough context window to understand work that unfolds over minutes, not seconds. Standard Intelligence says FDM-1 was trained on videos from a portion of an 11-million-hour screen recording dataset, and that their video encoder can compress almost two hours of 30 FPS video into about one million tokens. The point isn’t the number as a flex; it’s what the number is trying to buy: continuity. If your model can’t keep the thread of what happened 30 seconds ago, it can’t be a competent coworker in a real tool.

The critique they’re making—implicitly, and sometimes explicitly—is that the dominant approach to ‘agents that use computers’ has been too screenshot-shaped. Take a vision-language model, fine-tune it on labeled screenshots of actions, then build reinforcement learning environments for the downstream task you care about. That pipeline can produce impressive point solutions, but it’s brittle for anything that looks like actual knowledge work: long-horizon workflows, continuous cursor movements, high-frame-rate interaction, and the messy reality that the right next step often depends on something you did five minutes ago.

FDM-1’s training story is also an argument about labeling. If you require humans to annotate every click and keystroke, your dataset stays small because the economics force it to. So they lean on a technique from imitation learning: train an inverse dynamics model (IDM) that looks at ‘before’ and ‘after’ states and predicts the action that likely happened in between. Their claim is that for many screen recordings, the action is inferable directly from the pixels—if a ‘K’ appears, a ‘K’ key was pressed; if a menu opens under the cursor, a click happened there—and that with enough history you can resolve harder cases. In other words: use a model to create labels at scale, so you can train the main model at scale.

This is not a brand-new concept, and it’s worth naming the lineage because it tells you what’s real here and what’s still aspirational. OpenAI’s Video PreTraining (VPT) work in 2022 showed a version of this loop in a constrained domain: train an IDM with a small amount of labeled data, use it to label a much larger set of unlabeled videos, then train a behavioral prior—in their case from Minecraft videos, using the native mouse/keyboard interface. And earlier work like Behavioral Cloning from Observation (BCO) formalized the two-phase idea: first learn an inverse model from self-supervised interaction, then learn to imitate an expert from observations alone.

Standard Intelligence is effectively trying to run that playbook on ‘the computer’ as a domain, and that’s a meaningful escalation. The computer isn’t one environment. It’s a shifting landscape of interfaces, UI frameworks, scrolling, popups, and tiny micro-interactions you only learn by doing. If you can get an internet-scale video corpus of people operating software, you can try to learn the shared grammar that underlies all of it: how humans search, compare, undo, recover, and complete work through interfaces.

Their post includes demos in three directions that hint at what they think the model is good at. One is CAD-like work, with continuous mouse movements in Blender. Another is automated UI testing and ‘fuzzing,’ where the goal isn’t to complete a single happy-path workflow but to explore a state space and find weird edge cases—like a banking app allowing a ‘Submit Wire Transfer’ button to be clickable right after a transfer completes, driving a balance negative. The third demo is deliberately provocative: after fine-tuning on less than an hour of collected data, they show key-press control in a real car via a ‘joystick mode’ setup.

If you’re an everyday professional, it’s tempting to watch this and file it under ‘cool research demo.’ But the reason this matters is more mundane: a competent computer-use model is a universal adapter. It doesn’t need every tool to have an API. It doesn’t need your company’s internal software to ship an SDK. It can, in principle, operate the same interfaces you operate. That’s the difference between AI as a suggestion engine and AI as an operator.

There’s a particularly important shift embedded here: video as the training substrate. Screenshot agents are like reading a flipbook with half the pages missing. You lose the microstructure of interaction—hover states, cursor trajectories, subtle UI feedback, the difference between a drag and a click-and-hold. A model trained directly on video has a chance of learning the ‘feel’ of interface control in the way humans do. That’s the part that could make these systems usable for workflows that aren’t easily discretized into a handful of click actions.

Now for the part everyone should say out loud: this doesn’t mean your laptop is about to get a magical autopilot. The gap between a demo and a reliable coworker is mostly made of unglamorous constraints: safety boundaries, permissioning, recoverability, and the ability to ask for help instead of confidently wrecking state. But if you want a practical way to think about the trajectory, it’s this: the winning computer-use agent won’t be the one that can do a thousand tasks once. It will be the one that can do ten common tasks, every time, and gracefully fail when it can’t.

So what do you do with this today, besides feel like the future is arriving early? You watch for where ‘computer use’ becomes a feature, not a product. UI testing teams will care because fuzzing and exploratory testing are expensive and under-automated. CAD and design workflows will care because so much of the work is spatial and continuous, not neatly ‘callable’ through an API. Operations teams will care because the long tail of internal tools—the portal someone built in 2017 that still runs revenue ops—suddenly becomes automatable without rewriting it.

And you personally should care because the interface you’ve been optimizing your career around is about to change. If your job is partly ‘I know how to make the software do the thing,’ then the scarce skill becomes higher-level: choosing what the thing should be, verifying outputs, designing guardrails, and building feedback loops so the operator gets better over time.

The internet trained language models on how we write. Video-trained action models are an attempt to train on how we do. If FDM-1’s approach scales the way its authors believe, the next wave of AI won’t just draft the email. It’ll open the CRM, pull the context, update the record, and tee up the send—then wait for you to say yes.

Frequently Asked

What is FDM-1, in plain English?

It’s a model trained to take actions on a computer by learning from video of screens (and inferred mouse/keyboard actions), aiming to handle longer, multi-step workflows than screenshot-based agents typically can.

Why does training on video matter?

Video contains the continuity of real interaction—cursor motion, timing, and interface feedback—so a model can learn control and long-horizon context instead of reacting to isolated screenshots.

What should operators watch for next?

Computer-use capabilities will increasingly ship as features inside existing tools (testing, ops, design) with safety and approval gates—not as standalone ‘agent’ toys.

Sources

Related Articles

Services

Explore AI Coaching Programs

Solutions

Browse AI Systems by Team

Resources

Use Implementation Templates