Tencent YouTu Lab Research Paper

From Chatbot to Digital Colleague

The Paradigm Shift Toward Persistent Autonomous AI

A survey of how large language models are evolving from conversational answer generators into integrated AI systems for reasoning, action, memory, and self-improvement.

Yongheng Zhang*Ziang Liu*Jiaxuan Zhu*Shuai WangXiangqi ChenHaojing HuangJiayi KuangSiyu ChenAo ShenHao WuQiufeng WangQian-Wen ZhangJunnan Dong
Wenhao JiangYing ShenHai-Tao ZhengYinghui LiDi YinXing SunPhilip S. Yu

* Equal contribution Corresponding author

Tencent Youtu LabTsinghua UniversitySun Yat-sen University
Central South UniversityUniversity of Illinois at Chicago

Overview of the transition from conversational chatbots to persistent digital colleagues.
Overview of the transition from conversational chatbots to persistent digital colleagues.

Overview

Abstract

Large Language Models (LLMs) are undergoing a fundamental transformation from conversational generators into integrated AI systems for reasoning, action, memory, and self-improvement. We conceptualize this transition as a shift from Chatbot to Digital Colleague: from conversational answers to persistent work. We organize this transition along two tightly coupled dimensions. First, at the cognitive core level, LLMs evolve from Chatbot-era "fast thinking" systems driven by next-token prediction to Thinking LLMs that use inference-time computation, Chain-of-Thought reasoning, reflection, process supervision, and reinforcement learning for deliberate cognition. Second, at the tool-augmented task execution level, LLMs progress from fragile Agents that invoke tools into OpenClaw-style workstation systems (OpenClaw) with persistent Workspaces, skills, verification loops, and governance. The proposed "Workspace + Skill" paradigm turns episodic tool use into colleague-like work, enabling state persistence, reusable procedures, task closure, and experience reuse. Finally, we examine shifts in data construction from instruction-response pairs to State-Action-Observation trajectories and evaluation from static benchmarks to sandboxed, auditable, self-evolving AI ecosystems.

Key Contributions

01

Two-Dimensional View

Cognitive-core evolution and tool-augmented task execution organize the field's progress.

02

Digital Colleague Shift

Workspace + Skill explains how chatbot-style interaction becomes persistent digital work.

03

Task Closure

State, reusable procedures, verification, recoverability, and governance become system-level requirements.

04

Auditable AI Ecosystems

Data and evaluation move toward trajectories, final-state evidence, and sandboxed task completion.

Roadmap

Roadmap of Next-Generation AI Systems

The paper organizes AI systems across cognitive-core evolution and tool-augmented task execution: from chatbots, to thinking LLMs, to agents, and then to persistent workspace systems.

Roadmap and evolutionary timeline of next-generation AI systems.
A roadmap and evolutionary timeline showing how AI systems progress from chatbots to reasoning cores, tool-using agents, and persistent workspace systems.
Time horizon growth of frontier AI agents.
Time horizon growth shows the movement from short response generation toward longer, more complex, and sustained task completion.

Cognitive Core

From Fast Response to Deliberate Reasoning

The first axis concerns how the model itself thinks. Chatbot-era systems are fast, stateless generators. Thinking LLMs introduce slower cognition through inference-time computation, long traces, reflection, process supervision, and reinforcement learning.

Chatbot Era

In the Chatbot Era, a user asks a natural-language question, the model performs fast single-pass inference over compressed parametric knowledge, and the system returns a fluent response.

The Chatbot Era diagram.
The Chatbot Era: fast, stateless, single-pass response generation.

Thinking LLM Era

Thinking LLMs allocate additional computation at inference time. They explore alternatives, verify intermediate steps, and revise before returning an answer.

The Thinking LLM Era diagram.
The Thinking LLM Era: slow, reflective reasoning with longer chains and deeper search.

Representative-model tables summarize the transition from non-reasoning chatbot-era LLMs to reasoning-oriented Thinking LLMs.

Representative Chatbot-Era LLMs
ModelRel.Para.TypeAcc.ModelRel.Para.TypeAcc.
GPT-12018-06117MTextOpenOrca2023-0613BTextClosed
GPT-22019-021.5BTextOpenLlama 22023-077–70BTextOpen
PLATO2019-10132MTextOpenInternLM2023-077B/20BTextOpen
T52019-1060M–11BTextOpenClaude 22023-07-TextClosed
DialoGPT2019-11117M/345M/762MTextOpenWizardLM2023-077B/13B/30BTextOpen
Meena2020-012.6BTextClosedQwen2023-081.8–72BTextOpen
BlenderBot2020-0490M/2.7B/9.4BTextOpenQwen-VL2023-089.6BMultiOpen
GPT-32020-05175BTextClosedOpenFlamingo2023-083–9BMultiOpen
PLATO-22020-0693M/314M/1.6BTextOpenCode Llama-Instruct2023-087B/13B/34BCodeOpen
BlenderBot 22021-07400M/2.7BTextOpenWizardMath2023-087B/13B/70BTextOpen
Jurassic-12021-08178BTextClosedWizardCoder2023-087B/13B/34BCodeOpen
Codex2021-0812M-12BTextClosedIDEFICS2023-089B/80BMultiOpen
HyperCLOVA2021-0982BTextClosedPhi-1.52023-091.3BTextOpen
PLATO-XL2021-0911BTextOpenBaichuan 22023-097B/13BTextOpen
Gopher2021-12280BTextClosedGPT-4V2023-09-MultiClosed
ERNIE 3.0 Titan2021-12260BTextClosedMistral 7B2023-097.3BTextOpen
GLaM2021-121.2T-A97BTextClosedMixtral2023-097BTextOpen
LaMDA2022-01137BTextClosedKimi / Moonshot2023-10-TextClosed
AlphaCode2022-029B/41BTextClosedERNIE 4.02023-10-MultiClosed
InstructGPT2022-031.3–175BTextClosedFuyu2023-108BMultiOpen
Chinchilla2022-0370BTextClosedZephyr-7B2023-107BTextOpen
CodeGen2022-03350M/2B/6B/16BTextOpenChatGLM3-6B2023-106BTextOpen
PaLM2022-04540BTextClosedSkywork-13B2023-1013BTextOpen
Flamingo2022-0480BMultiClosedGPT-4 Turbo2023-11-TextClosed
OPT2022-05125M–175BTextOpenGrok-12023-11314B-A78.5BTextOpen
GODEL2022-06220M/770MTextOpenYi2023-116B/34BTextOpen
BLOOM2022-07176BTextOpenCogVLM2023-1117BMultiOpen
BlenderBot 32022-083B/30B/175BTextOpenClaude 2.12023-11-TextClosed
PaLI / PaLI-X2022-0917B/55BMultiClosedInflection-22023-11-TextClosed
Sparrow2022-0970BTextClosedDeepSeek Coder Instruct2023-111B–33BCodeOpen
CodeGeeX2022-0913BTextOpenOpenChat 3.52023-117BTextOpen
GLM-130B2022-10130BTextOpenDeepSeek LLM2023-117B/67BTextOpen
Galactica2022-11120BTextOpenOrca 22023-117B/13BTextOpen
BLIP-22023-014B-12BMultiOpenMixtral 8x7B2023-1247B-A13BTextOpen
Llama2023-027–65BTextOpenPhi-22023-122.7BTextOpen
Alpaca2023-037BTextOpenGemini 1.02023-12-MultiClosed
Claude 12023-03-TextClosedInternVL 1.02023-126B+MultiOpen
PanGu-Sigma2023-031.085TTextClosedSOLAR-10.7B2023-1210.7BTextOpen
BloombergGPT2023-0350BTextClosedGLM-42024-01-TextClosed
ChatGLM-6B2023-03~6.2BTextOpenGLM-4V2024-019BMultiClosed
GPT-42023-03-MultiClosedLLaVA-NeXT2024-017–34BMultiOpen
PaLM-E2023-03562BMultiClosedStable LM 22024-011.6BTextOpen
Vicuna2023-037B/13BTextOpenYi-VL2024-016B/34BMultiOpen
GPT-3.5 Turbo2023-03-TextClosedMistral Large2024-02-TextClosed
Pythia2023-0470M–12BTextOpenQwen 1.52024-020.5–72BTextOpen
LLaVA2023-047B/13BMultiOpenGemini 1.52024-02-MultiClosed
MiniGPT-42023-047B/13BMultiOpenOLMo2024-021B/7BTextOpen
Dolly 2.02023-0412BTextOpenStarCoder22024-023B/7B/15BCodeOpen
Stable LM2023-043B/7BTextOpenReka Flash2024-0221BMultiClosed
Falcon2023-057–180BTextOpenGemma2024-032B/7BTextOpen
MPT2023-057B/30BTextOpenQwen1.5-MoE2024-0314B-A2.7BTextOpen
StarCoder2023-0515.5BTextOpenDBRX2024-03132B-A36BTextOpen
RedPajama2023-053B/7BTextOpenJamba2024-0352B-A12BTextOpen
InstructBLIP2023-05-MultiOpenClaude 32024-03-MultiClosed
PaLM 22023-05-TextClosedCommand R2024-0335BTextOpen
CodeT5+2023-05220M-16BCodeOpenInflection-2.52024-03-TextClosed
Inflection-12023-06-TextClosedDeepSeek-VL2024-031.3B/7BMultiOpen
Phi-12023-061.3BTextOpenGrok-1.52024-03-TextClosed
Aquila2023-067B/33BTextOpenMM12024-033B/7B/30BMultiClosed
ChatGLM2-6B2023-066BTextOpenYi-9B2024-039BTextOpen
Baichuan-Chat2023-067B/13BTextOpenPhi-32024-043.8–14BTextOpen
XGen-7B2023-067BTextOpenMixtral 8x22B2024-04141B-A39BTextOpen
Llama 32024-048B-A70BTextOpenLlama-3.1-Nemotron-70B2024-1170BTextOpen
Command R+2024-04104BTextOpenHunyuan-Large2024-11389B-A52BTextOpen
InternVL 1.52024-0426BMultiOpenOLMo 22024-117B/13BTextOpen
Reka Core2024-04-MultiClosedPixtral Large2024-11124BMultiMixed
CodeQwen1.52024-047BCodeOpenSmolVLM2024-112BMultiOpen
IDEFICS22024-048BMultiOpenDeepSeek-V32024-12671B-A37BTextOpen
OpenELM2024-04270M-3BTextOpenLlama 3.32024-1270BTextOpen
Snowflake Arctic2024-04480B-A17BTextOpenPaliGemma 22024-123B/10B/28BMultiOpen
Doubao2024-05-TextClosedDeepSeek-VL22024-1227B-A4.5BMultiOpen
DeepSeek-V22024-05236B-A21BTextOpenFalcon 32024-121B-10BTextOpen
GPT-4o2024-05-MultiClosedGranite 3.12024-121B-8BTextOpen
CogVLM22024-0519BMultiOpenInternVL2.52024-121B–78BMultiOpen
MiniCPM-V2024-052–8BMultiOpenMiniMax-Text-012025-01456B-A45.9BTextOpen
Codestral2024-0522BCodeOpenMiniMax-VL-012025-01456B-A45.9BMultiOpen
Falcon 22024-0511BMultiOpenQwen2.5-Max2025-01-TextClosed
PaliGemma2024-053BMultiOpenMiniCPM-o 2.62025-018BMultiOpen
Aya 232024-058B/35BTextOpenQwen2.5-VL2025-013B/7B/72BMultiOpen
Granite Code2024-053B-34BCodeOpenJanus-Pro2025-011B/7BMultiOpen
Qwen 22024-060.5–72BTextOpenMistral Small 32025-0124BTextOpen
GLM-4-9B2024-069BTextOpenGPT-4.52025-02-TextClosed
Claude 3.5 Sonnet2024-06-MultiClosedPhi-4-mini2025-024BTextOpen
Cambrian-12024-063–34BMultiOpenPhi-4-multimodal2025-026BMultiOpen
DeepSeek-Coder-V22024-06236B-A21BCodeOpenCommand A2025-03111BTextClosed
Nemotron-42024-06340BTextOpenMistral Small 3.12025-0324BMultiOpen
Gemma 22024-062B/9B/27BTextOpenAya Vision2025-038B/32BMultiOpen
Skywork-MoE2024-06146B-A22BTextOpenQwen2.5-VL-32B2025-0332BMultiOpen
InternVL 2.02024-071–76BMultiOpenOLMo 2 32B2025-0332BTextOpen
Llama 3.12024-078–405BTextOpenGPT-4.12025-04-MultiClosed
InternLM 2.52024-071.8–20BTextOpenGPT-4.1 mini2025-04-MultiClosed
GPT-4o mini2024-07-MultiClosedGPT-4.1 nano2025-04-MultiClosed
Codestral Mamba2024-077BCodeOpenGranite 3.32025-042B/8BTextOpen
Mistral NeMo2024-0712BTextOpenKimi-VL-A3B-Instruct2025-0416B-A2.8BMultiOpen
SmolLM2024-07135M/360M/1.7BTextOpenAmazon Nova Premier2025-04-MultiClosed
Mistral Large 22024-07123BTextOpenMistral Medium 32025-05-MultiClosed
LLaVA-OV2024-080.5–72BMultiOpenDevstral2025-0524BCodeOpen
Grok-22024-08-TextClosedERNIE-4.5-300B-A47B2025-06300B-A47BMultiOpen
Grok-1.5V2024-08-MultiClosedQwen3-4B-Instruct2025-074BTextOpen
Phi-3.5-mini-instruct2024-083.8BTextOpenKimi K2 Instruct2025-071T-A32BTextOpen
Phi-3.5-MoE-instruct2024-0842B-A6.6BMultiOpenQwen3-Coder2025-07480B-A35BTextOpen
Jamba 1.52024-08398B-A94BTextOpenFastVLM2025-070.5B/1.5B/7BMultiOpen
Qwen2-VL2024-092–72BMultiOpenLFM2-VL2025-08450M/1.6B/3BMultiOpen
Llama 3.2 Text2024-091B/3BTextOpenLongCat-Flash-Chat2025-08560B-A27BMultiOpen
Llama 3.2 Vision2024-0911B/90BMultiOpenQwen3-Next2025-0981B-A3BTextOpen
Qwen2.52024-090.5–72BTextOpenQwen3-VL2025-09235B-A22BMultiOpen
Pixtral2024-0912BMultiOpenMistral Large 32025-12675B-A41BMultiOpen
OLMoE2024-097B-A1BTextOpenMinistral 3 Instruct2025-123B/8B/14BMultiOpen
Molmo2024-097B/72BMultiOpenDevstral 22025-12123BTextOpen
Claude 3.5 Haiku2024-10-TextClosedDevstral Small 22025-1224BTextOpen
Aya Expanse2024-108B/32BTextOpenQwen3-Coder-Next2026-0280B-A3BTextOpen
Granite 3.02024-101B-8BTextOpenLongCat-Flash-Lite2026-0268.5B-A3BTextOpen
Yi-Lightning2024-10-TextClosedMistral Small 4-instruct2026-03119B-A6BMultiOpen
Qwen2.5-Coder2024-110.5–32BTextOpenLongCat-Next2026-0368.5B-A3BMultiOpen
Representative Thinking LLMs
ModelRel.Para.TypeAcc.ModelRel.Para.TypeAcc.
o1-preview2024-09-TextClosedClaude 4 Opus2025-05-MultiClosed
o1-mini2024-09-TextClosedMiniMax-M12025-06456B/46BTextOpen
Marco-o12024-117BTextOpenKimi-Dev-72B2025-0672BCodeOpen
QwQ-32B-Preview2024-1132BTextOpenMiMo-VL-7B2025-067BMultiOpen
Skywork-o1 Open2024-118BTextOpenHunyuan-A13B-Instruct2025-0680B-A13BTextOpen
o12024-12-TextClosedKimi K22025-071T/32BMultiOpen
o1-pro2024-12-TextClosedQwen3-Coder2025-07480B/35BCodeOpen
Gemini 2.0 Flash Thinking2024-12-MultiClosedQwen3-235B-Thinking-25072025-07235B/22BTextOpen
QVQ-72B-Preview2024-1272BMultiOpenGrok 42025-07-MultiClosed
DeepSeek-R1-Zero2025-01671B/37BTextOpenSmolLM32025-073BTextOpen
DeepSeek-R12025-01671B/37BTextOpenGPT-52025-08~300BMultiClosed
R1-Distill-Qwen2025-011.5B–32BTextOpenDeepSeek-V3.12025-08685B/37BTextOpen
R1-Distill-Llama2025-018B/70BTextOpenGPT-oss-120B2025-08117B/5.1BTextOpen
Kimi k1.52025-01-MultiClosedGPT-oss-20B2025-0820BTextOpen
Sky-T1-32B2025-0132BTextOpenClaude Opus 4.12025-08-MultiClosed
o3-mini2025-01-TextClosedERNIE 4.5-Thinking2025-0921B/3BTextOpen
s12025-0232BTextOpenClaude Sonnet 4.52025-09-MultiClosed
LIMO2025-0232BTextOpenGrok 4 Fast2025-09-MultiClosed
Grok 32025-02-MultiClosedMiniMax-M22025-10230B/10BMultiOpen
Grok 3 mini2025-02-TextClosedClaude Haiku 4.52025-10-MultiClosed
Claude 3.7 Sonnet2025-02-MultiClosedGrok 4.1 Fast2025-10-MultiClosed
Hunyuan-T1-Preview2025-02-TextClosedRing-1T2025-101T-A50BTextOpen
Open-Reasoner-Zero2025-027B/32BTextOpenGPT-5.12025-11-MultiClosed
TinyZero2025-023BTextOpenGemini 3 Pro2025-11-MultiClosed
Eurus-2-PRIME2025-027BTextOpenGrok 4.12025-11-MultiClosed
Bespoke-Stratos2025-027BTextOpenClaude Opus 4.52025-11-MultiClosed
Light-R12025-027B/14BTextOpenDeepSeek-V3.22025-12671B/37BTextOpen
Hunyuan-TurboS2025-02560B-A56BTextClosedDeepSeek-V3.2-Speciale2025-12671B/37BTextOpen
Gemma 32025-034B/12B/27BMultiOpenGemini 3 Flash2025-12-MultiClosed
QwQ-32B2025-0332BTextOpenMiMo-V2-Flash2025-12309B/15BTextOpen
Hunyuan-T12025-03-TextClosedGLM-4.72025-12358BTextOpen
Gemini 2.5 Pro2025-03-MultiClosedDevstral 22025-12123BCodeOpen
DeepSeek-V3-03242025-03671B/37BTextOpenGPT-5.22025-12-MultiClosed
Phi-4-reasoning2025-0414BTextOpenLongCat-Flash-Thinking-26012026-01560B-A27BTextOpen
Phi-4-reasoning-plus2025-0414BTextOpenStep 3.5 Flash2026-02-TextOpen
Qwen32025-040.6B–235BTextOpenKimi K2.52026-021T/32BMultiOpen
o32025-04-MultiClosedQwen3.52026-02397B/17BMultiOpen
o4-mini2025-04-MultiClosedGemini 3.1 Pro2026-02-MultiClosed
Kimi-VL-A3B-Thinking2025-042.8B act.MultiOpenGPT-5.3-Codex2026-02-CodeClosed
GLM-Z1-32B2025-0432BTextOpenClaude Opus 4.62026-02-MultiClosed
Z1-Rumination-32B2025-0432BTextOpenMiniMax-M2.52026-02230B-A10BTextOpen
GLM-Z1-9B2025-049BTextOpenGPT-5.42026-03-MultiClosed
Llama 4 Maverick2025-04400B/17BMultiOpenNemotron-Cascade-22026-0330B/3BCodeOpen
Llama 4 Scout2025-04109B/17BMultiOpenGPT-5.32026-03-MultiClosed
Seed-Thinking-v1.52025-04-TextOpenMiniMax-M2.72026-03230B-A10BTextOpen
Nemotron-Ultra-253B2025-04253B/17BTextOpenMiMo-V2.5-Pro2026-04-MultiOpen
ERNIE-4.5-VL2025-04424B-A47BMultiOpenKimi K2.62026-041T/32BMultiOpen
Codex-12025-05-CodeClosedGLM-5.12026-04754BTextOpen
DeepSeek-R1-05282025-05671B/37BTextOpenDeepSeek-V42026-041.6TTextOpen
R1-Distill-Qwen3-8B2025-058BTextOpenQwen3.62026-0435B/3B+MultiOpen
R1-Distill-Qwen3-32B2025-0532BTextOpenGemma 42026-042B–26BMultiOpen
MiMo-7B-RL2025-057BTextOpenGPT-5.52026-04-MultiClosed
MiMo-7B-RL-05302025-057BTextOpenClaude Opus 4.72026-04-MultiClosed
Doubao 1.5 Pro Thinking2025-05-TextClosedClaude Mythos Preview2026-04-MultiClosed
Gemini 2.5 Flash2025-05-MultiClosedGrok 4.32026-05-MultiClosed
InternVL32025-052B–78BMultiOpenRing-2.6-1T2026-051T-A63BTextOpen
Devstral2025-0524BCodeOpenERNIE 5.12026-05-MultiClosed
Claude 4 Sonnet2025-05-MultiClosedClaude Opus 4.82026-05-MultiClosed

Task Execution

From Agents to OpenClaw

The second axis asks how a stronger cognitive core becomes a system that can act. The Agent Era introduces environment-action-feedback loops. The OpenClaw Era embeds action inside persistent workspaces.

Agent Era

Agent systems break from single-turn answering by allowing LLMs to plan, call tools, browse, write code, manipulate files, and react to observations.

The Agent Era diagram.
The Agent Era: observe, think, act, receive feedback, and iterate toward a task goal.

OpenClaw Era

OpenClaw-style systems treat the workspace as the host of work, turning actions into inspectable operations over files, terminals, browsers, services, and skills.

The OpenClaw Era diagram.
The OpenClaw Era: persistent workspaces, reusable skills, verification loops, and governed task closure.

Agent Era vs. OpenClaw Era

The boundary table clarifies why OpenClaw is not just another agent loop: the organizing abstraction changes from external tool interaction to persistent workspace-based task hosting.

DimensionAgent EraOpenClaw Era
Organizing abstractionEnvironment-action-feedback loopPersistent workspace for task hosting
Unit of actionTool call or external API invocationWorkspace operation over files, terminals, browsers, services, and skills
State modelEpisodic observations and short-horizon memoryDurable files, sessions, logs, repositories, local memory, and snapshots
Knowledge reusePrompt patterns, retrieved memory, or ad-hoc demonstrationsReusable skill packages with instructions, scripts, dependencies, examples, and checks
Evaluation focusAction correctness and trajectory success rateTask closure, final-state verification, repeatability, and auditability
Safety boundaryPrompt-level guardrails and tool-use policiesRuntime permissions, provenance tracking, audit logs, and governance over workspace changes
Representative Agent Works
WorkYearCategoryRoleKey Contribution
ReAct2022Agent ArchitectureAgent FrameworkThought-Action-Observation loop
HuggingGPT2023PerceptionAgent SystemLLM orchestration of Hugging Face models
Reflexion2023PlanningAgent FrameworkLanguage reflections on failed attempts
Generative Agents2023MemoryAgent SystemMemory stream with reflections
Toolformer2023Tool UseAgent ModelSelf-supervised tool-call learning
WebArena2023BenchmarkEvaluationRealistic web task environments

Workspace + Skill

The Mechanism Behind Digital Colleagues

Workspace + Skill turns episodic interaction into durable digital work. A workspace provides state, evidence, recoverability, and consequences. A skill provides reusable operational knowledge.

Simple tool invocation diagram.
Simple tool invocation helps with local sub-tasks, but isolated calls are not enough for long-horizon work.
Workspace and Skill paradigm diagram.
The Workspace + Skill paradigm combines persistent environments with reusable procedures.
Representative OpenClaw-Era Works
WorkYearCategoryRoleKey Contribution
OpenClaw2026WorkspaceAgent FrameworkPersistent workspace with tools, channels, and skills
OpenHands2024WorkspaceAgent PlatformCode editing, shell execution, and browsing in controlled environments
SWE-agent2024WorkspaceAgent-Computer InterfaceRepository navigation and test-feedback interface for agents
Voyager2023SkillAgent SystemExecutable skill library learned from environment feedback
Anthropic Agent Skills2026SkillSkill InfrastructureFolder-based skills with instructions, scripts, and resources
ClawGuard2026GovernanceRuntime GuardrailFile, command, network, and skill boundary enforcement

Data & Eval

From Static Answers to Auditable Work

As AI systems move from answering questions to acting in workspaces, both data and evaluation must change. The unit of learning and measurement becomes the full state-action-observation trajectory and the verified final state.

Data paradigm shift diagram.
Data evolves from prompt-response pairs to reasoning traces and state-action-observation trajectories.
Evaluation paradigm shift diagram.
Evaluation shifts from final-answer correctness to process judgment and task closure.

Data Construction

Agent and OpenClaw data must capture tool outputs, UI states, file changes, terminal errors, permissions, workspace snapshots, skills, and final-state evidence.

StageCore Data UnitTraining / Supervision SignalRepresentative ResourcesEvaluation Focus
ChatbotStatic corpora and instruction–response pairsHuman demonstrations, preference comparisons, safety labels, and dialogue correctionsInstructGPT/RLHF, FLAN/T0, Self-Instruct and open SFT dataAnswer correctness, fluency, helpfulness, preference alignment, and instruction following on mostly static inputs
Thinking LLM EraReasoning-process traces and intermediate solution pathsChain-of-thought rationales, self-generated reasoning, step-wise verification, process rewards, and preference optimizationCoT / zero-shot CoT, Self-Consistency and ToT, PRM800K and Math-Shepherd, DeepSeek-R1Reliability of the reasoning path, verifiable math/code performance, step-level correctness, and robustness beyond final-answer accuracy
Agent EraState–action–observation trajectories with tool feedbackTool-call traces, API arguments, execution results, environment feedback, and multi-step recovery signalsToolformer, API-Bank / Gorilla / ToolBench, WebArena and OSWorldTask success in interactive environments, correct tool selection, argument generation, state tracking, and feedback-driven continuation
OpenClaw / Workspace EraWorkspace-level trajectories plus reusable skills and final-state evidenceFile, shell, browser, UI, permission, snapshot, skill-package, and safety-policy traces with executable verificationSWE-bench, ToolSandbox, ClawsBench, ATBench-Claw and ClawSafetyEnd-to-end task closure, state verifiability, reproducibility, efficiency, rollback behavior, and trajectory-level safety

Evaluation

Next-generation systems must be assessed by reasoning validity, state changes, reliability, efficiency, reproducibility, safety, and task closure.

StageEvaluation ObjectCore MetricsRepresentative BenchmarksMain Limitation
Final-Output EvaluationStatic answers, labels, generated text, or executable final outputsAccuracy, exact match, BLEU/ROUGE, preference win rate, and Pass@1MMLU, GSM8K / MATH, GPQA / FrontierMath, BIG-Bench / HELMScores the endpoint but cannot reveal whether the model used a valid reasoning path or merely reached the right answer accidentally
Process-Level EvaluationReasoning traces, intermediate steps, critiques, and verification pathsStep correctness, judge preference, process-reward quality, consistency, and contamination resistanceHard2Verify / DeltaBench, ProcessBench / PRMBenchImproves trace inspection but may rely on judge models, incomplete process labels, or reasoning that is not grounded in external state
Task-Closure EvaluationInteractive trajectories and final workspace states after tool, web, file, or UI operationsTask success rate, final-state verification, tool-call efficiency, reliability, reproducibility, and trajectory-level safetySWE-bench, WebArena, OSWorld, ToolSandbox / tau-benchRequires executable environments, reproducible initial states, trajectory logs, replay mechanisms, and costly final-state checks
Workspace/OpenClaw EvaluationPersistent workspaces with skills, permissions, snapshots, external services, and auditable action chainsClosure rate, unsafe-action rate, rollback behavior, skill stability, state diffs, auditability, and governance complianceClaw-Eval, ClawBench, ClawsBench, ATBench-Claw, ClawSafetyMakes evaluation realistic but increases infrastructure cost, safety risk, simulator-design burden, and cross-run comparability challenges

Evaluation Stages

Stage I: Final-Output Evaluation

Representative final-output benchmark scores across MMLU, GSM8K, MATH, HumanEval, and related metrics.

ModelBase ModelMMLUMMLU-ProGSM8KMATHMATH-500HumanEval
GPT-5.4-94.087.098.190.2-94.1
Claude Opus 4.6-92.182.597.891.5-92.4
Gemini 3.1 Pro-92.691.294.285.3-87.6
DeepSeek-V4-Pro-Base-90.173.592.664.5-76.8
Qwen3.7 Max--89.6-94.6-92.4
GLM-5.1-89.086.095.383.4-88.6
SAGE-32B (Think)Qwen2.5-32B90.2079.2796.74-91.8-
Warmup K&KQwen2.5-14B-62.7--77.4-
AceMath-72B-InstructQwen2.5-Math-72B-Instruct--96.486.1--
PromptCoT-DS-7BDeepSeek-R1-Distill-Qwen-7B--92.6-93.0-
Nemotron-CrossThink-32BQwen2.5-32B83.6069.43--84.00-
Introspective X Training-50.927.959.546.5-54.9
CoT2-MetaClaude Sonnet 4.5-88.498.692.8-72.8
Guideline ForestGPT-4o-mini--93.5-69.295.4
STOP-ECNDeepSeek-R1-Distill-Qwen-7B--91.1-86.8-
FastMCTS+Branch-DPOFastMCTS-7B--89.975.4--
FastMCTSQwen2.5-7B--88.974.0--
Rejection SamplingQwen2.5-7B--87.170.0--
SBSDeepSeek-Math-7B-Base--84.166.3--
MCTSDeepSeek-Math-7B-Base--83.264.0--
DeepSeekMath-7B-RLDeepSeekMath-7B--88.251.7--
SimPOQwen2.5-Math-7B-Instruct--88.840.056.6-
Self-ExploreDeepSeek-Math-7B-Base--78.637.7--
DeepSeek-Coder-V2-Instruct---94.975.7-90.2
OMI2 (Full)Qwen2.5-Coder-7B--88.573.2--
CODEI/OQwen2.5-Coder-7B--86.471.9--
PyEduQwen2.5-Coder-7B--85.871.4--
MathCoder-CLCode-Llama-7B--67.830.2--
Stage II: Process-Level Reasoning Evaluation

Process-oriented evaluation across Hard2Verify, DeltaBench, ProcessBench, and PRMBench.

Model / MethodBase ModelHard2VerifyDeltaBenchProcessBenchPRMBench
Step AStep F1Resp. AResp. F1ErrID AErrID F1Avg.HMCorr.Err.GSM8KMATHOlympiadOmniOverallSimp.Sound.Sens.
GPT-5-86.5385.8389.6989.5270.6169.72------------
Gemini 2.5 Pro-83.3783.0985.7385.4652.4652.46------------
Claude Sonnet 4-70.6160.3778.2473.4453.4539.30------------
DeepSeek-R1-68.9262.3073.9572.7554.2345.35--------67.862.971.477.1
Qwen3-235B-A22B-72.5564.0379.4277.8760.9050.78------------
Qwen3-Next-80B-A3B-67.9154.6975.0868.3158.2943.05------------
GPT-4o-------49.948.742.057.979.263.651.453.566.859.770.975.8
o1-mini-----------93.288.987.282.468.864.672.175.5
Gemini-2.0-thinking-exp-1219---------------68.866.271.875.3
QwQ-32B-Preview-----------88.078.757.861.363.656.468.273.5
Llama-3.3-70B-Instruct-54.2818.3757.0428.1649.442.50----82.959.446.743.0----
Qwen2.5-72B-Instruct-56.0126.3661.0646.8926.4916.38----76.261.854.652.2----
Qwen2.5-14B-Instruct-60.4547.5963.4063.2343.4718.86----69.353.345.041.3----
Qwen2.5-Math-72B-Instruct-----------65.852.132.531.757.455.161.167.1
Qwen2.5-Math-PRM-72BQwen2.5-Math-72B55.8235.5066.8064.9141.8037.28----87.380.674.371.168.254.673.977.0
Qwen2.5-Math-PRM-7BQwen2.5-Math-7B57.5642.3763.0857.5735.0332.50----82.477.667.566.365.552.171.075.5
UniversalPRM-7BQwen2.5-Math-7B-Instruct64.1760.2754.7441.4626.0825.97----85.877.767.666.4----
ActPRM-XQwen2.5-Math-PRM-7B----------82.782.072.067.366.754.572.775.6
ActPRMQwen2.5-Math-7B----------81.679.871.467.065.553.671.375.2
RefCritic-R1-14BDeepSeek-R1-Distill-Qwen-14B----------86.382.067.672.3----
RefCritic-Qwen-14BQwen2.5-14B-Instruct----------81.971.258.160.7----
FlexiVe (Think@64)DeepSeek-R1-Distill-Qwen-14B----------88.190.186.780.4----
FlexiVe (Flex@128)DeepSeek-R1-Distill-Qwen-14B----------83.085.080.075.2----
GenPRM-32B (Maj@8)Qwen2.5-32B----------85.186.378.980.1----
GenPRM-7B (Maj@8)Qwen2.5-7B----------81.085.778.476.8----
SPC (Round 2)Qwen2.5-7B-Instruct------60.559.568.252.8--------
SPC (Round 1)Qwen2.5-7B-Instruct------58.857.368.449.3--------
SPC (Round 0)Qwen2.5-7B-Instruct------54.953.545.964.0--------
Qwen2.5-Math-7B-PRM800KQwen2.5-Math-7B------58.541.390.126.868.262.650.744.3----
Pure-PRM-7BQwen2.5-Math-7B----------69.066.548.445.965.352.270.275.8
Skywork-PRM-7BQwen2.5-Math-7B38.5234.1256.7729.8111.568.36----70.853.622.921.065.159.668.573.3
Math-Shepherd-PRM-7BMistral-7B------53.314.37.6998.847.929.524.823.847.047.145.760.7
RLHFlow-PRM-Mistral-8BLlama-3.1-8B----------50.433.413.815.854.446.757.568.5
ReasonEval-34BCodeLlama-34B--------------60.551.563.073.1
ReasonFlux-PRM-7BDeepSeek-R1-Distill-Qwen-7B53.0922.4055.8953.8242.4828.71------------
uPRMQwen2.5-Math-7B----------58.352.642.739.8----

Notes. Hard2Verify reports balanced accuracy and F1 for step, response, and error-identification tasks; DeltaBench reports average, HM, correct-step recall, and error-step recall; ProcessBench and PRMBench report their original F1/PRMScore metrics.

Stage III: Task-Closure Evaluation

Interactive task success and pass-rate evaluation across SWE, terminal, OS, web, and browsing benchmarks.

Model / MethodBase ModelSWE-VTerminal 2.0OSWorld-VWebArena-VBrowseCompMCP-Atlas
GPT-5.4 xHigh--75.175.067.382.767.2
Claude Opus 4.6 Max-80.865.4--83.773.8
Gemini 3.1 Pro High-80.668.5--85.969.2
DeepSeek-V4-Pro Max-80.667.9--83.473.6
Kimi-K2.6 Thinking-80.266.7--83.266.6
GLM-5.1 Thinking--63.5--79.371.8
UI-TARS-2UI-TARS-268.745.3*--29.6*-
OpenCUA-72BQwen2.5-VL-72B--45.0---
SWE-ExpClaude 4 Sonnet73.0-----
Kimi-DevQwen2.5-72B-Base60.4-----
SWE-Master-32B-RLQwen2.5-Coder-32B61.4-----
PDR+RTVGemini 3.1 Pro76.6064.77----
TACT-GATEQwen3.5-27B73.336.0----
IHR+NLAHGPT-5.4-mini73.053.9----
Polar RL (Pi)Qwen3.5-4B40.4-----
SA-SWE-32BQwen3-32B39.416.25--19.4 (+)-
CODESKILLQwen3.5-35B-A3B66.034.12----
CodeScout-14BQwen3-Coder-30B-A3B46.00-----
TACOMiniMax-M2.5-44.16----
ComputerRLGLM-4.1V-9B-Thinking--48.0---
UltraCUA-32B-RLUltraCUA-32B--43.7---
OS-SymphonyGPT-5--65.8---

Notes. SWE-V denotes SWE-bench Verified; OSWorld-V denotes OSWorld-Verified; WebArena-V denotes WebArena-Verified; Terminal 2.0 denotes Terminal-Bench v2.0.

Stage IV: Workspace and OpenClaw Evaluation

Workspace-level evaluation, OpenClaw metrics, and trajectory-level safety where lower attack success rate is better.

Model / MethodSettingClaw-EvalClawBenchClawsBenchATBench-ClawClawSafety
Gen.MultiSRTSRUARSCRAcc.F1Rec.ASR ↓
Claude Opus 4.6OpenClaw on/on70.868.4-632350----
Claude Sonnet 4.6OpenClaw / ClawSafety68.365.833.3561348---40.0
MiMo-V2.5-ProClaw-Eval64.063.2--------
GLM-5.1Claw-Eval62.760.5--------
Muse SparkClaw-Eval62.768.4--------
Kimi K2.6Claw-Eval61.565.8--------
GPT-5.4OpenClaw on/on60.260.56.553741----
DeepSeek V4 ProClaw-Eval58.465.8--------
Qwen3.6 PlusClaw-Eval57.165.8--------
Qwen3.5-397B-A17BAgentDoG prompt57.852.6----83.886.587.5-
GLM-5OpenClaw text-only / on/on--24.2602348----
Gemini 3 FlashClawBench--19.0-------
Claude Haiku 4.5ClawBench--18.3-------
Gemini 3.1 Flash-LiteOpenClaw on/on--3.3392326----
Kimi K2.5OpenClaw / ClawSafety52.850.00.7------60.8
Gemini 3.1 ProOpenClaw on/on55.965.8-581048----
Qwen3Guard-Gen-8BGuard model------52.136.323.1-
Llama-Guard-4-12BGuard model------74.473.460.0-
ShieldAgentGuard model------68.160.143.3-
Llama-3.3-70B-InstructAgentDoG prompt------80.682.376.4-
AgentDoG-Qwen3-4BAgentDoG------87.289.692.9-
Gemini 2.5 ProOpenClaw---------55.0
DeepSeek V3OpenClaw---------67.5
GPT-5.1OpenClaw---------75.0
Claude Sonnet 4.6 + NanobotNanobot scaffold---------48.6
Claude Sonnet 4.6 + NemoClawNemoClaw scaffold---------45.8

Notes. Claw-Eval reports general and multi-turn PassAll3; ClawsBench reports task success, unsafe action, and safe completion rates; ClawSafety reports attack success rate, where lower is better.

Future

Challenges and Future Directions

Reliable autonomy introduces failures that are longer-horizon, stateful, harder to reverse, and tied to real workspace changes. Future systems need stronger closure, safer permissions, robust memory, and governance.

Open challenges for reliable autonomy.
Open challenges for reliable autonomy: task closure, safety and governance, memory, context management, and persistent workspace state.
Future directions toward self-evolving AI ecosystems.
Future directions toward self-evolving AI ecosystems that accumulate experience and improve their operating environments.

Open Challenges: Making Autonomy Reliable

Current agents can look impressive in isolated demonstrations, but dependable digital work requires stable performance across long horizons, safe operation under permission constraints, and persistent memory that does not collapse as context grows. The challenge is not only stronger capability, but autonomous behavior that is auditable, recoverable, and controllable.

Long-Horizon Task Closure

Agents must maintain progress until the intended workspace state is achieved. Tool errors, partial failures, and early planning mistakes can compound across many actions, so future systems need progress monitoring, intermediate verification, self-healing, and rollback to safe checkpoints.

Safety and Governance

Once agents can touch files, browsers, APIs, terminals, databases, and enterprise applications, safety becomes operational. Fine-grained permissions, risk-aware action validation, audit trails, sandboxing, and calibrated human approval are required for high-impact actions.

Memory and Persistent State

Long-running work needs more than a transient context window. Agents need working memory for the current trajectory, episodic memory for past interactions, procedural memory for reusable skills, and external state stores that preserve decision-critical facts while discarding noise.

Future Directions: Toward Self-Evolving AI Ecosystems

The next stage is likely to be defined not only by larger foundation models, but by the ecosystems around them. Capability becomes distributed across prompts, context pipelines, tools, memories, workspaces, evaluators, skills, policies, and governance layers. In this view, operational experience becomes the material from which systems improve.

Prompt, Context, and Harness Engineering

Prompts specify intent, context supplies task-relevant state, and harnesses define the operational world where actions have consequences. Future progress depends on execution substrates that make actions inspectable, constrained, replayable, and learnable.

AI-Native Workspaces

A workspace gives the model a digital body: files, terminals, browsers, repositories, logs, snapshots, permissions, provenance, replay, rollback, and final-state diffs. These primitives make both safety and learning possible.

Beyond-Gradient Learning

Improvement can happen outside model weights. A failed tool call can become a better wrapper, a repeated correction can become memory, a successful trajectory can become a skill, and a benchmark failure can become a regression test.

Composable Skills and Multi-Agent Work

Skills need interfaces, versions, dependency contracts, tests, documentation, security review, and deprecation. Multi-agent systems also need shared state, ownership boundaries, escalation protocols, and final-state verifiers.

Toward Self-Evolving AI Ecosystems

A governed self-evolving loop turns operational traces into durable assets: successful trajectories become reusable skills, failures become regression tests, user corrections become memories, tool errors become wrapper updates, and safety incidents become policies. Every consequential action should become evidence, and every durable update should be validated, versioned, auditable, reversible, and deployed under explicit permission boundaries.

Resources

Resources

Access the paper, code, citation, project page, and contact channel for Next-Generation AI Systems.

Citation

Use the BibTeX below to cite this work.

@misc{zhang2026chatbotdigitalcolleagueparadigm,
      title={From Chatbot to Digital Colleague: The Paradigm Shift Toward Persistent Autonomous AI}, 
      author={Yongheng Zhang and Ziang Liu and Jiaxuan Zhu and Shuai Wang and Xiangqi Chen and Haojing Huang and Jiayi Kuang and Siyu Chen and Ao Shen and Hao Wu and Qiufeng Wang and Qian-Wen Zhang and Junnan Dong and Wenhao Jiang and Ying Shen and Hai-Tao Zheng and Yinghui Li and Di Yin and Xing Sun and Philip S. Yu},
      year={2026},
      eprint={2606.14502},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2606.14502}, 
}