Next-Generation AI Systems

Overview

Abstract

Large Language Models (LLMs) are undergoing a fundamental transformation from conversational generators into integrated AI systems for reasoning, action, memory, and self-improvement. We conceptualize this transition as a shift from Chatbot to Digital Colleague: from conversational answers to persistent work. We organize this transition along two tightly coupled dimensions. First, at the cognitive core level, LLMs evolve from Chatbot-era "fast thinking" systems driven by next-token prediction to Thinking LLMs that use inference-time computation, Chain-of-Thought reasoning, reflection, process supervision, and reinforcement learning for deliberate cognition. Second, at the tool-augmented task execution level, LLMs progress from fragile Agents that invoke tools into OpenClaw-style workstation systems (OpenClaw) with persistent Workspaces, skills, verification loops, and governance. The proposed "Workspace + Skill" paradigm turns episodic tool use into colleague-like work, enabling state persistence, reusable procedures, task closure, and experience reuse. Finally, we examine shifts in data construction from instruction-response pairs to State-Action-Observation trajectories and evaluation from static benchmarks to sandboxed, auditable, self-evolving AI ecosystems.

Key Contributions

01

Two-Dimensional View

Cognitive-core evolution and tool-augmented task execution organize the field's progress.

02

Digital Colleague Shift

Workspace + Skill explains how chatbot-style interaction becomes persistent digital work.

03

Task Closure

State, reusable procedures, verification, recoverability, and governance become system-level requirements.

04

Auditable AI Ecosystems

Data and evaluation move toward trajectories, final-state evidence, and sandboxed task completion.

Roadmap

Roadmap of Next-Generation AI Systems

The paper organizes AI systems across cognitive-core evolution and tool-augmented task execution: from chatbots, to thinking LLMs, to agents, and then to persistent workspace systems.

Roadmap and evolutionary timeline of next-generation AI systems. — A roadmap and evolutionary timeline showing how AI systems progress from chatbots to reasoning cores, tool-using agents, and persistent workspace systems.

Time horizon growth of frontier AI agents. — Time horizon growth shows the movement from short response generation toward longer, more complex, and sustained task completion.

Cognitive Core

From Fast Response to Deliberate Reasoning

The first axis concerns how the model itself thinks. Chatbot-era systems are fast, stateless generators. Thinking LLMs introduce slower cognition through inference-time computation, long traces, reflection, process supervision, and reinforcement learning.

Chatbot Era

In the Chatbot Era, a user asks a natural-language question, the model performs fast single-pass inference over compressed parametric knowledge, and the system returns a fluent response.

Thinking LLM Era

Thinking LLMs allocate additional computation at inference time. They explore alternatives, verify intermediate steps, and revise before returning an answer.

Representative-model tables summarize the transition from non-reasoning chatbot-era LLMs to reasoning-oriented Thinking LLMs.

Representative Chatbot-Era LLMs

Model	Rel.	Para.	Type	Acc.	Model	Rel.	Para.	Type	Acc.
GPT-1	2018-06	117M	Text	Open	Orca	2023-06	13B	Text	Closed
GPT-2	2019-02	1.5B	Text	Open	Llama 2	2023-07	7–70B	Text	Open
PLATO	2019-10	132M	Text	Open	InternLM	2023-07	7B/20B	Text	Open
T5	2019-10	60M–11B	Text	Open	Claude 2	2023-07	-	Text	Closed
DialoGPT	2019-11	117M/345M/762M	Text	Open	WizardLM	2023-07	7B/13B/30B	Text	Open
Meena	2020-01	2.6B	Text	Closed	Qwen	2023-08	1.8–72B	Text	Open
BlenderBot	2020-04	90M/2.7B/9.4B	Text	Open	Qwen-VL	2023-08	9.6B	Multi	Open
GPT-3	2020-05	175B	Text	Closed	OpenFlamingo	2023-08	3–9B	Multi	Open
PLATO-2	2020-06	93M/314M/1.6B	Text	Open	Code Llama-Instruct	2023-08	7B/13B/34B	Code	Open
BlenderBot 2	2021-07	400M/2.7B	Text	Open	WizardMath	2023-08	7B/13B/70B	Text	Open
Jurassic-1	2021-08	178B	Text	Closed	WizardCoder	2023-08	7B/13B/34B	Code	Open
Codex	2021-08	12M-12B	Text	Closed	IDEFICS	2023-08	9B/80B	Multi	Open
HyperCLOVA	2021-09	82B	Text	Closed	Phi-1.5	2023-09	1.3B	Text	Open
PLATO-XL	2021-09	11B	Text	Open	Baichuan 2	2023-09	7B/13B	Text	Open
Gopher	2021-12	280B	Text	Closed	GPT-4V	2023-09	-	Multi	Closed
ERNIE 3.0 Titan	2021-12	260B	Text	Closed	Mistral 7B	2023-09	7.3B	Text	Open
GLaM	2021-12	1.2T-A97B	Text	Closed	Mixtral	2023-09	7B	Text	Open
LaMDA	2022-01	137B	Text	Closed	Kimi / Moonshot	2023-10	-	Text	Closed
AlphaCode	2022-02	9B/41B	Text	Closed	ERNIE 4.0	2023-10	-	Multi	Closed
InstructGPT	2022-03	1.3–175B	Text	Closed	Fuyu	2023-10	8B	Multi	Open
Chinchilla	2022-03	70B	Text	Closed	Zephyr-7B	2023-10	7B	Text	Open
CodeGen	2022-03	350M/2B/6B/16B	Text	Open	ChatGLM3-6B	2023-10	6B	Text	Open
PaLM	2022-04	540B	Text	Closed	Skywork-13B	2023-10	13B	Text	Open
Flamingo	2022-04	80B	Multi	Closed	GPT-4 Turbo	2023-11	-	Text	Closed
OPT	2022-05	125M–175B	Text	Open	Grok-1	2023-11	314B-A78.5B	Text	Open
GODEL	2022-06	220M/770M	Text	Open	Yi	2023-11	6B/34B	Text	Open
BLOOM	2022-07	176B	Text	Open	CogVLM	2023-11	17B	Multi	Open
BlenderBot 3	2022-08	3B/30B/175B	Text	Open	Claude 2.1	2023-11	-	Text	Closed
PaLI / PaLI-X	2022-09	17B/55B	Multi	Closed	Inflection-2	2023-11	-	Text	Closed
Sparrow	2022-09	70B	Text	Closed	DeepSeek Coder Instruct	2023-11	1B–33B	Code	Open
CodeGeeX	2022-09	13B	Text	Open	OpenChat 3.5	2023-11	7B	Text	Open
GLM-130B	2022-10	130B	Text	Open	DeepSeek LLM	2023-11	7B/67B	Text	Open
Galactica	2022-11	120B	Text	Open	Orca 2	2023-11	7B/13B	Text	Open
BLIP-2	2023-01	4B-12B	Multi	Open	Mixtral 8x7B	2023-12	47B-A13B	Text	Open
Llama	2023-02	7–65B	Text	Open	Phi-2	2023-12	2.7B	Text	Open
Alpaca	2023-03	7B	Text	Open	Gemini 1.0	2023-12	-	Multi	Closed
Claude 1	2023-03	-	Text	Closed	InternVL 1.0	2023-12	6B+	Multi	Open
PanGu-Sigma	2023-03	1.085T	Text	Closed	SOLAR-10.7B	2023-12	10.7B	Text	Open
BloombergGPT	2023-03	50B	Text	Closed	GLM-4	2024-01	-	Text	Closed
ChatGLM-6B	2023-03	~6.2B	Text	Open	GLM-4V	2024-01	9B	Multi	Closed
GPT-4	2023-03	-	Multi	Closed	LLaVA-NeXT	2024-01	7–34B	Multi	Open
PaLM-E	2023-03	562B	Multi	Closed	Stable LM 2	2024-01	1.6B	Text	Open
Vicuna	2023-03	7B/13B	Text	Open	Yi-VL	2024-01	6B/34B	Multi	Open
GPT-3.5 Turbo	2023-03	-	Text	Closed	Mistral Large	2024-02	-	Text	Closed
Pythia	2023-04	70M–12B	Text	Open	Qwen 1.5	2024-02	0.5–72B	Text	Open
LLaVA	2023-04	7B/13B	Multi	Open	Gemini 1.5	2024-02	-	Multi	Closed
MiniGPT-4	2023-04	7B/13B	Multi	Open	OLMo	2024-02	1B/7B	Text	Open
Dolly 2.0	2023-04	12B	Text	Open	StarCoder2	2024-02	3B/7B/15B	Code	Open
Stable LM	2023-04	3B/7B	Text	Open	Reka Flash	2024-02	21B	Multi	Closed
Falcon	2023-05	7–180B	Text	Open	Gemma	2024-03	2B/7B	Text	Open
MPT	2023-05	7B/30B	Text	Open	Qwen1.5-MoE	2024-03	14B-A2.7B	Text	Open
StarCoder	2023-05	15.5B	Text	Open	DBRX	2024-03	132B-A36B	Text	Open
RedPajama	2023-05	3B/7B	Text	Open	Jamba	2024-03	52B-A12B	Text	Open
InstructBLIP	2023-05	-	Multi	Open	Claude 3	2024-03	-	Multi	Closed
PaLM 2	2023-05	-	Text	Closed	Command R	2024-03	35B	Text	Open
CodeT5+	2023-05	220M-16B	Code	Open	Inflection-2.5	2024-03	-	Text	Closed
Inflection-1	2023-06	-	Text	Closed	DeepSeek-VL	2024-03	1.3B/7B	Multi	Open
Phi-1	2023-06	1.3B	Text	Open	Grok-1.5	2024-03	-	Text	Closed
Aquila	2023-06	7B/33B	Text	Open	MM1	2024-03	3B/7B/30B	Multi	Closed
ChatGLM2-6B	2023-06	6B	Text	Open	Yi-9B	2024-03	9B	Text	Open
Baichuan-Chat	2023-06	7B/13B	Text	Open	Phi-3	2024-04	3.8–14B	Text	Open
XGen-7B	2023-06	7B	Text	Open	Mixtral 8x22B	2024-04	141B-A39B	Text	Open
Llama 3	2024-04	8B-A70B	Text	Open	Llama-3.1-Nemotron-70B	2024-11	70B	Text	Open
Command R+	2024-04	104B	Text	Open	Hunyuan-Large	2024-11	389B-A52B	Text	Open
InternVL 1.5	2024-04	26B	Multi	Open	OLMo 2	2024-11	7B/13B	Text	Open
Reka Core	2024-04	-	Multi	Closed	Pixtral Large	2024-11	124B	Multi	Mixed
CodeQwen1.5	2024-04	7B	Code	Open	SmolVLM	2024-11	2B	Multi	Open
IDEFICS2	2024-04	8B	Multi	Open	DeepSeek-V3	2024-12	671B-A37B	Text	Open
OpenELM	2024-04	270M-3B	Text	Open	Llama 3.3	2024-12	70B	Text	Open
Snowflake Arctic	2024-04	480B-A17B	Text	Open	PaliGemma 2	2024-12	3B/10B/28B	Multi	Open
Doubao	2024-05	-	Text	Closed	DeepSeek-VL2	2024-12	27B-A4.5B	Multi	Open
DeepSeek-V2	2024-05	236B-A21B	Text	Open	Falcon 3	2024-12	1B-10B	Text	Open
GPT-4o	2024-05	-	Multi	Closed	Granite 3.1	2024-12	1B-8B	Text	Open
CogVLM2	2024-05	19B	Multi	Open	InternVL2.5	2024-12	1B–78B	Multi	Open
MiniCPM-V	2024-05	2–8B	Multi	Open	MiniMax-Text-01	2025-01	456B-A45.9B	Text	Open
Codestral	2024-05	22B	Code	Open	MiniMax-VL-01	2025-01	456B-A45.9B	Multi	Open
Falcon 2	2024-05	11B	Multi	Open	Qwen2.5-Max	2025-01	-	Text	Closed
PaliGemma	2024-05	3B	Multi	Open	MiniCPM-o 2.6	2025-01	8B	Multi	Open
Aya 23	2024-05	8B/35B	Text	Open	Qwen2.5-VL	2025-01	3B/7B/72B	Multi	Open
Granite Code	2024-05	3B-34B	Code	Open	Janus-Pro	2025-01	1B/7B	Multi	Open
Qwen 2	2024-06	0.5–72B	Text	Open	Mistral Small 3	2025-01	24B	Text	Open
GLM-4-9B	2024-06	9B	Text	Open	GPT-4.5	2025-02	-	Text	Closed
Claude 3.5 Sonnet	2024-06	-	Multi	Closed	Phi-4-mini	2025-02	4B	Text	Open
Cambrian-1	2024-06	3–34B	Multi	Open	Phi-4-multimodal	2025-02	6B	Multi	Open
DeepSeek-Coder-V2	2024-06	236B-A21B	Code	Open	Command A	2025-03	111B	Text	Closed
Nemotron-4	2024-06	340B	Text	Open	Mistral Small 3.1	2025-03	24B	Multi	Open
Gemma 2	2024-06	2B/9B/27B	Text	Open	Aya Vision	2025-03	8B/32B	Multi	Open
Skywork-MoE	2024-06	146B-A22B	Text	Open	Qwen2.5-VL-32B	2025-03	32B	Multi	Open
InternVL 2.0	2024-07	1–76B	Multi	Open	OLMo 2 32B	2025-03	32B	Text	Open
Llama 3.1	2024-07	8–405B	Text	Open	GPT-4.1	2025-04	-	Multi	Closed
InternLM 2.5	2024-07	1.8–20B	Text	Open	GPT-4.1 mini	2025-04	-	Multi	Closed
GPT-4o mini	2024-07	-	Multi	Closed	GPT-4.1 nano	2025-04	-	Multi	Closed
Codestral Mamba	2024-07	7B	Code	Open	Granite 3.3	2025-04	2B/8B	Text	Open
Mistral NeMo	2024-07	12B	Text	Open	Kimi-VL-A3B-Instruct	2025-04	16B-A2.8B	Multi	Open
SmolLM	2024-07	135M/360M/1.7B	Text	Open	Amazon Nova Premier	2025-04	-	Multi	Closed
Mistral Large 2	2024-07	123B	Text	Open	Mistral Medium 3	2025-05	-	Multi	Closed
LLaVA-OV	2024-08	0.5–72B	Multi	Open	Devstral	2025-05	24B	Code	Open
Grok-2	2024-08	-	Text	Closed	ERNIE-4.5-300B-A47B	2025-06	300B-A47B	Multi	Open
Grok-1.5V	2024-08	-	Multi	Closed	Qwen3-4B-Instruct	2025-07	4B	Text	Open
Phi-3.5-mini-instruct	2024-08	3.8B	Text	Open	Kimi K2 Instruct	2025-07	1T-A32B	Text	Open
Phi-3.5-MoE-instruct	2024-08	42B-A6.6B	Multi	Open	Qwen3-Coder	2025-07	480B-A35B	Text	Open
Jamba 1.5	2024-08	398B-A94B	Text	Open	FastVLM	2025-07	0.5B/1.5B/7B	Multi	Open
Qwen2-VL	2024-09	2–72B	Multi	Open	LFM2-VL	2025-08	450M/1.6B/3B	Multi	Open
Llama 3.2 Text	2024-09	1B/3B	Text	Open	LongCat-Flash-Chat	2025-08	560B-A27B	Multi	Open
Llama 3.2 Vision	2024-09	11B/90B	Multi	Open	Qwen3-Next	2025-09	81B-A3B	Text	Open
Qwen2.5	2024-09	0.5–72B	Text	Open	Qwen3-VL	2025-09	235B-A22B	Multi	Open
Pixtral	2024-09	12B	Multi	Open	Mistral Large 3	2025-12	675B-A41B	Multi	Open
OLMoE	2024-09	7B-A1B	Text	Open	Ministral 3 Instruct	2025-12	3B/8B/14B	Multi	Open
Molmo	2024-09	7B/72B	Multi	Open	Devstral 2	2025-12	123B	Text	Open
Claude 3.5 Haiku	2024-10	-	Text	Closed	Devstral Small 2	2025-12	24B	Text	Open
Aya Expanse	2024-10	8B/32B	Text	Open	Qwen3-Coder-Next	2026-02	80B-A3B	Text	Open
Granite 3.0	2024-10	1B-8B	Text	Open	LongCat-Flash-Lite	2026-02	68.5B-A3B	Text	Open
Yi-Lightning	2024-10	-	Text	Closed	Mistral Small 4-instruct	2026-03	119B-A6B	Multi	Open
Qwen2.5-Coder	2024-11	0.5–32B	Text	Open	LongCat-Next	2026-03	68.5B-A3B	Multi	Open

Representative Thinking LLMs

Model	Rel.	Para.	Type	Acc.	Model	Rel.	Para.	Type	Acc.
o1-preview	2024-09	-	Text	Closed	Claude 4 Opus	2025-05	-	Multi	Closed
o1-mini	2024-09	-	Text	Closed	MiniMax-M1	2025-06	456B/46B	Text	Open
Marco-o1	2024-11	7B	Text	Open	Kimi-Dev-72B	2025-06	72B	Code	Open
QwQ-32B-Preview	2024-11	32B	Text	Open	MiMo-VL-7B	2025-06	7B	Multi	Open
Skywork-o1 Open	2024-11	8B	Text	Open	Hunyuan-A13B-Instruct	2025-06	80B-A13B	Text	Open
o1	2024-12	-	Text	Closed	Kimi K2	2025-07	1T/32B	Multi	Open
o1-pro	2024-12	-	Text	Closed	Qwen3-Coder	2025-07	480B/35B	Code	Open
Gemini 2.0 Flash Thinking	2024-12	-	Multi	Closed	Qwen3-235B-Thinking-2507	2025-07	235B/22B	Text	Open
QVQ-72B-Preview	2024-12	72B	Multi	Open	Grok 4	2025-07	-	Multi	Closed
DeepSeek-R1-Zero	2025-01	671B/37B	Text	Open	SmolLM3	2025-07	3B	Text	Open
DeepSeek-R1	2025-01	671B/37B	Text	Open	GPT-5	2025-08	~300B	Multi	Closed
R1-Distill-Qwen	2025-01	1.5B–32B	Text	Open	DeepSeek-V3.1	2025-08	685B/37B	Text	Open
R1-Distill-Llama	2025-01	8B/70B	Text	Open	GPT-oss-120B	2025-08	117B/5.1B	Text	Open
Kimi k1.5	2025-01	-	Multi	Closed	GPT-oss-20B	2025-08	20B	Text	Open
Sky-T1-32B	2025-01	32B	Text	Open	Claude Opus 4.1	2025-08	-	Multi	Closed
o3-mini	2025-01	-	Text	Closed	ERNIE 4.5-Thinking	2025-09	21B/3B	Text	Open
s1	2025-02	32B	Text	Open	Claude Sonnet 4.5	2025-09	-	Multi	Closed
LIMO	2025-02	32B	Text	Open	Grok 4 Fast	2025-09	-	Multi	Closed
Grok 3	2025-02	-	Multi	Closed	MiniMax-M2	2025-10	230B/10B	Multi	Open
Grok 3 mini	2025-02	-	Text	Closed	Claude Haiku 4.5	2025-10	-	Multi	Closed
Claude 3.7 Sonnet	2025-02	-	Multi	Closed	Grok 4.1 Fast	2025-10	-	Multi	Closed
Hunyuan-T1-Preview	2025-02	-	Text	Closed	Ring-1T	2025-10	1T-A50B	Text	Open
Open-Reasoner-Zero	2025-02	7B/32B	Text	Open	GPT-5.1	2025-11	-	Multi	Closed
TinyZero	2025-02	3B	Text	Open	Gemini 3 Pro	2025-11	-	Multi	Closed
Eurus-2-PRIME	2025-02	7B	Text	Open	Grok 4.1	2025-11	-	Multi	Closed
Bespoke-Stratos	2025-02	7B	Text	Open	Claude Opus 4.5	2025-11	-	Multi	Closed
Light-R1	2025-02	7B/14B	Text	Open	DeepSeek-V3.2	2025-12	671B/37B	Text	Open
Hunyuan-TurboS	2025-02	560B-A56B	Text	Closed	DeepSeek-V3.2-Speciale	2025-12	671B/37B	Text	Open
Gemma 3	2025-03	4B/12B/27B	Multi	Open	Gemini 3 Flash	2025-12	-	Multi	Closed
QwQ-32B	2025-03	32B	Text	Open	MiMo-V2-Flash	2025-12	309B/15B	Text	Open
Hunyuan-T1	2025-03	-	Text	Closed	GLM-4.7	2025-12	358B	Text	Open
Gemini 2.5 Pro	2025-03	-	Multi	Closed	Devstral 2	2025-12	123B	Code	Open
DeepSeek-V3-0324	2025-03	671B/37B	Text	Open	GPT-5.2	2025-12	-	Multi	Closed
Phi-4-reasoning	2025-04	14B	Text	Open	LongCat-Flash-Thinking-2601	2026-01	560B-A27B	Text	Open
Phi-4-reasoning-plus	2025-04	14B	Text	Open	Step 3.5 Flash	2026-02	-	Text	Open
Qwen3	2025-04	0.6B–235B	Text	Open	Kimi K2.5	2026-02	1T/32B	Multi	Open
o3	2025-04	-	Multi	Closed	Qwen3.5	2026-02	397B/17B	Multi	Open
o4-mini	2025-04	-	Multi	Closed	Gemini 3.1 Pro	2026-02	-	Multi	Closed
Kimi-VL-A3B-Thinking	2025-04	2.8B act.	Multi	Open	GPT-5.3-Codex	2026-02	-	Code	Closed
GLM-Z1-32B	2025-04	32B	Text	Open	Claude Opus 4.6	2026-02	-	Multi	Closed
Z1-Rumination-32B	2025-04	32B	Text	Open	MiniMax-M2.5	2026-02	230B-A10B	Text	Open
GLM-Z1-9B	2025-04	9B	Text	Open	GPT-5.4	2026-03	-	Multi	Closed
Llama 4 Maverick	2025-04	400B/17B	Multi	Open	Nemotron-Cascade-2	2026-03	30B/3B	Code	Open
Llama 4 Scout	2025-04	109B/17B	Multi	Open	GPT-5.3	2026-03	-	Multi	Closed
Seed-Thinking-v1.5	2025-04	-	Text	Open	MiniMax-M2.7	2026-03	230B-A10B	Text	Open
Nemotron-Ultra-253B	2025-04	253B/17B	Text	Open	MiMo-V2.5-Pro	2026-04	-	Multi	Open
ERNIE-4.5-VL	2025-04	424B-A47B	Multi	Open	Kimi K2.6	2026-04	1T/32B	Multi	Open
Codex-1	2025-05	-	Code	Closed	GLM-5.1	2026-04	754B	Text	Open
DeepSeek-R1-0528	2025-05	671B/37B	Text	Open	DeepSeek-V4	2026-04	1.6T	Text	Open
R1-Distill-Qwen3-8B	2025-05	8B	Text	Open	Qwen3.6	2026-04	35B/3B+	Multi	Open
R1-Distill-Qwen3-32B	2025-05	32B	Text	Open	Gemma 4	2026-04	2B–26B	Multi	Open
MiMo-7B-RL	2025-05	7B	Text	Open	GPT-5.5	2026-04	-	Multi	Closed
MiMo-7B-RL-0530	2025-05	7B	Text	Open	Claude Opus 4.7	2026-04	-	Multi	Closed
Doubao 1.5 Pro Thinking	2025-05	-	Text	Closed	Claude Mythos Preview	2026-04	-	Multi	Closed
Gemini 2.5 Flash	2025-05	-	Multi	Closed	Grok 4.3	2026-05	-	Multi	Closed
InternVL3	2025-05	2B–78B	Multi	Open	Ring-2.6-1T	2026-05	1T-A63B	Text	Open
Devstral	2025-05	24B	Code	Open	ERNIE 5.1	2026-05	-	Multi	Closed
Claude 4 Sonnet	2025-05	-	Multi	Closed	Claude Opus 4.8	2026-05	-	Multi	Closed

Task Execution

From Agents to OpenClaw

The second axis asks how a stronger cognitive core becomes a system that can act. The Agent Era introduces environment-action-feedback loops. The OpenClaw Era embeds action inside persistent workspaces.

Agent Era

Agent systems break from single-turn answering by allowing LLMs to plan, call tools, browse, write code, manipulate files, and react to observations.

OpenClaw Era

OpenClaw-style systems treat the workspace as the host of work, turning actions into inspectable operations over files, terminals, browsers, services, and skills.

Agent Era vs. OpenClaw Era

The boundary table clarifies why OpenClaw is not just another agent loop: the organizing abstraction changes from external tool interaction to persistent workspace-based task hosting.

Dimension	Agent Era	OpenClaw Era
Organizing abstraction	Environment-action-feedback loop	Persistent workspace for task hosting
Unit of action	Tool call or external API invocation	Workspace operation over files, terminals, browsers, services, and skills
State model	Episodic observations and short-horizon memory	Durable files, sessions, logs, repositories, local memory, and snapshots
Knowledge reuse	Prompt patterns, retrieved memory, or ad-hoc demonstrations	Reusable skill packages with instructions, scripts, dependencies, examples, and checks
Evaluation focus	Action correctness and trajectory success rate	Task closure, final-state verification, repeatability, and auditability
Safety boundary	Prompt-level guardrails and tool-use policies	Runtime permissions, provenance tracking, audit logs, and governance over workspace changes

Representative Agent Works

Work	Year	Category	Role	Key Contribution
ReAct	2022	Agent Architecture	Agent Framework	Thought-Action-Observation loop
HuggingGPT	2023	Perception	Agent System	LLM orchestration of Hugging Face models
Reflexion	2023	Planning	Agent Framework	Language reflections on failed attempts
Generative Agents	2023	Memory	Agent System	Memory stream with reflections
Toolformer	2023	Tool Use	Agent Model	Self-supervised tool-call learning
WebArena	2023	Benchmark	Evaluation	Realistic web task environments

Workspace + Skill

The Mechanism Behind Digital Colleagues

Workspace + Skill turns episodic interaction into durable digital work. A workspace provides state, evidence, recoverability, and consequences. A skill provides reusable operational knowledge.

Simple tool invocation diagram. — Simple tool invocation helps with local sub-tasks, but isolated calls are not enough for long-horizon work.

Workspace and Skill paradigm diagram. — The Workspace + Skill paradigm combines persistent environments with reusable procedures.

Representative OpenClaw-Era Works

Work	Year	Category	Role	Key Contribution
OpenClaw	2026	Workspace	Agent Framework	Persistent workspace with tools, channels, and skills
OpenHands	2024	Workspace	Agent Platform	Code editing, shell execution, and browsing in controlled environments
SWE-agent	2024	Workspace	Agent-Computer Interface	Repository navigation and test-feedback interface for agents
Voyager	2023	Skill	Agent System	Executable skill library learned from environment feedback
Anthropic Agent Skills	2026	Skill	Skill Infrastructure	Folder-based skills with instructions, scripts, and resources
ClawGuard	2026	Governance	Runtime Guardrail	File, command, network, and skill boundary enforcement

Data & Eval

From Static Answers to Auditable Work

As AI systems move from answering questions to acting in workspaces, both data and evaluation must change. The unit of learning and measurement becomes the full state-action-observation trajectory and the verified final state.

Data paradigm shift diagram. — Data evolves from prompt-response pairs to reasoning traces and state-action-observation trajectories.

Evaluation paradigm shift diagram. — Evaluation shifts from final-answer correctness to process judgment and task closure.

Data Construction

Agent and OpenClaw data must capture tool outputs, UI states, file changes, terminal errors, permissions, workspace snapshots, skills, and final-state evidence.

Stage	Core Data Unit	Training / Supervision Signal	Representative Resources	Evaluation Focus
Chatbot	Static corpora and instruction–response pairs	Human demonstrations, preference comparisons, safety labels, and dialogue corrections	InstructGPT/RLHF, FLAN/T0, Self-Instruct and open SFT data	Answer correctness, fluency, helpfulness, preference alignment, and instruction following on mostly static inputs
Thinking LLM Era	Reasoning-process traces and intermediate solution paths	Chain-of-thought rationales, self-generated reasoning, step-wise verification, process rewards, and preference optimization	CoT / zero-shot CoT, Self-Consistency and ToT, PRM800K and Math-Shepherd, DeepSeek-R1	Reliability of the reasoning path, verifiable math/code performance, step-level correctness, and robustness beyond final-answer accuracy
Agent Era	State–action–observation trajectories with tool feedback	Tool-call traces, API arguments, execution results, environment feedback, and multi-step recovery signals	Toolformer, API-Bank / Gorilla / ToolBench, WebArena and OSWorld	Task success in interactive environments, correct tool selection, argument generation, state tracking, and feedback-driven continuation
OpenClaw / Workspace Era	Workspace-level trajectories plus reusable skills and final-state evidence	File, shell, browser, UI, permission, snapshot, skill-package, and safety-policy traces with executable verification	SWE-bench, ToolSandbox, ClawsBench, ATBench-Claw and ClawSafety	End-to-end task closure, state verifiability, reproducibility, efficiency, rollback behavior, and trajectory-level safety

Evaluation

Next-generation systems must be assessed by reasoning validity, state changes, reliability, efficiency, reproducibility, safety, and task closure.

Stage	Evaluation Object	Core Metrics	Representative Benchmarks	Main Limitation
Final-Output Evaluation	Static answers, labels, generated text, or executable final outputs	Accuracy, exact match, BLEU/ROUGE, preference win rate, and Pass@1	MMLU, GSM8K / MATH, GPQA / FrontierMath, BIG-Bench / HELM	Scores the endpoint but cannot reveal whether the model used a valid reasoning path or merely reached the right answer accidentally
Process-Level Evaluation	Reasoning traces, intermediate steps, critiques, and verification paths	Step correctness, judge preference, process-reward quality, consistency, and contamination resistance	Hard2Verify / DeltaBench, ProcessBench / PRMBench	Improves trace inspection but may rely on judge models, incomplete process labels, or reasoning that is not grounded in external state
Task-Closure Evaluation	Interactive trajectories and final workspace states after tool, web, file, or UI operations	Task success rate, final-state verification, tool-call efficiency, reliability, reproducibility, and trajectory-level safety	SWE-bench, WebArena, OSWorld, ToolSandbox / tau-bench	Requires executable environments, reproducible initial states, trajectory logs, replay mechanisms, and costly final-state checks
Workspace/OpenClaw Evaluation	Persistent workspaces with skills, permissions, snapshots, external services, and auditable action chains	Closure rate, unsafe-action rate, rollback behavior, skill stability, state diffs, auditability, and governance compliance	Claw-Eval, ClawBench, ClawsBench, ATBench-Claw, ClawSafety	Makes evaluation realistic but increases infrastructure cost, safety risk, simulator-design burden, and cross-run comparability challenges

Evaluation Stages

Stage I: Final-Output Evaluation

Representative final-output benchmark scores across MMLU, GSM8K, MATH, HumanEval, and related metrics.

Model	Base Model	MMLU	MMLU-Pro	GSM8K	MATH	MATH-500	HumanEval
GPT-5.4	-	94.0	87.0	98.1	90.2	-	94.1
Claude Opus 4.6	-	92.1	82.5	97.8	91.5	-	92.4
Gemini 3.1 Pro	-	92.6	91.2	94.2	85.3	-	87.6
DeepSeek-V4-Pro-Base	-	90.1	73.5	92.6	64.5	-	76.8
Qwen3.7 Max	-	-	89.6	-	94.6	-	92.4
GLM-5.1	-	89.0	86.0	95.3	83.4	-	88.6
SAGE-32B (Think)	Qwen2.5-32B	90.20	79.27	96.74	-	91.8	-
Warmup K&K	Qwen2.5-14B	-	62.7	-	-	77.4	-
AceMath-72B-Instruct	Qwen2.5-Math-72B-Instruct	-	-	96.4	86.1	-	-
PromptCoT-DS-7B	DeepSeek-R1-Distill-Qwen-7B	-	-	92.6	-	93.0	-
Nemotron-CrossThink-32B	Qwen2.5-32B	83.60	69.43	-	-	84.00	-
Introspective X Training	-	50.9	27.9	59.5	46.5	-	54.9
CoT2-Meta	Claude Sonnet 4.5	-	88.4	98.6	92.8	-	72.8
Guideline Forest	GPT-4o-mini	-	-	93.5	-	69.2	95.4
STOP-ECN	DeepSeek-R1-Distill-Qwen-7B	-	-	91.1	-	86.8	-
FastMCTS+Branch-DPO	FastMCTS-7B	-	-	89.9	75.4	-	-
FastMCTS	Qwen2.5-7B	-	-	88.9	74.0	-	-
Rejection Sampling	Qwen2.5-7B	-	-	87.1	70.0	-	-
SBS	DeepSeek-Math-7B-Base	-	-	84.1	66.3	-	-
MCTS	DeepSeek-Math-7B-Base	-	-	83.2	64.0	-	-
DeepSeekMath-7B-RL	DeepSeekMath-7B	-	-	88.2	51.7	-	-
SimPO	Qwen2.5-Math-7B-Instruct	-	-	88.8	40.0	56.6	-
Self-Explore	DeepSeek-Math-7B-Base	-	-	78.6	37.7	-	-
DeepSeek-Coder-V2-Instruct	-	-	-	94.9	75.7	-	90.2
OMI2 (Full)	Qwen2.5-Coder-7B	-	-	88.5	73.2	-	-
CODEI/O	Qwen2.5-Coder-7B	-	-	86.4	71.9	-	-
PyEdu	Qwen2.5-Coder-7B	-	-	85.8	71.4	-	-
MathCoder-CL	Code-Llama-7B	-	-	67.8	30.2	-	-

Stage II: Process-Level Reasoning Evaluation

Process-oriented evaluation across Hard2Verify, DeltaBench, ProcessBench, and PRMBench.

Model / Method	Base Model	Hard2Verify						DeltaBench				ProcessBench				PRMBench
Model / Method	Base Model	Step A	Step F1	Resp. A	Resp. F1	ErrID A	ErrID F1	Avg.	HM	Corr.	Err.	GSM8K	MATH	Olympiad	Omni	Overall	Simp.	Sound.	Sens.
GPT-5	-	86.53	85.83	89.69	89.52	70.61	69.72	-	-	-	-	-	-	-	-	-	-	-	-
Gemini 2.5 Pro	-	83.37	83.09	85.73	85.46	52.46	52.46	-	-	-	-	-	-	-	-	-	-	-	-
Claude Sonnet 4	-	70.61	60.37	78.24	73.44	53.45	39.30	-	-	-	-	-	-	-	-	-	-	-	-
DeepSeek-R1	-	68.92	62.30	73.95	72.75	54.23	45.35	-	-	-	-	-	-	-	-	67.8	62.9	71.4	77.1
Qwen3-235B-A22B	-	72.55	64.03	79.42	77.87	60.90	50.78	-	-	-	-	-	-	-	-	-	-	-	-
Qwen3-Next-80B-A3B	-	67.91	54.69	75.08	68.31	58.29	43.05	-	-	-	-	-	-	-	-	-	-	-	-
GPT-4o	-	-	-	-	-	-	-	49.9	48.7	42.0	57.9	79.2	63.6	51.4	53.5	66.8	59.7	70.9	75.8
o1-mini	-	-	-	-	-	-	-	-	-	-	-	93.2	88.9	87.2	82.4	68.8	64.6	72.1	75.5
Gemini-2.0-thinking-exp-1219	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	68.8	66.2	71.8	75.3
QwQ-32B-Preview	-	-	-	-	-	-	-	-	-	-	-	88.0	78.7	57.8	61.3	63.6	56.4	68.2	73.5
Llama-3.3-70B-Instruct	-	54.28	18.37	57.04	28.16	49.44	2.50	-	-	-	-	82.9	59.4	46.7	43.0	-	-	-	-
Qwen2.5-72B-Instruct	-	56.01	26.36	61.06	46.89	26.49	16.38	-	-	-	-	76.2	61.8	54.6	52.2	-	-	-	-
Qwen2.5-14B-Instruct	-	60.45	47.59	63.40	63.23	43.47	18.86	-	-	-	-	69.3	53.3	45.0	41.3	-	-	-	-
Qwen2.5-Math-72B-Instruct	-	-	-	-	-	-	-	-	-	-	-	65.8	52.1	32.5	31.7	57.4	55.1	61.1	67.1
Qwen2.5-Math-PRM-72B	Qwen2.5-Math-72B	55.82	35.50	66.80	64.91	41.80	37.28	-	-	-	-	87.3	80.6	74.3	71.1	68.2	54.6	73.9	77.0
Qwen2.5-Math-PRM-7B	Qwen2.5-Math-7B	57.56	42.37	63.08	57.57	35.03	32.50	-	-	-	-	82.4	77.6	67.5	66.3	65.5	52.1	71.0	75.5
UniversalPRM-7B	Qwen2.5-Math-7B-Instruct	64.17	60.27	54.74	41.46	26.08	25.97	-	-	-	-	85.8	77.7	67.6	66.4	-	-	-	-
ActPRM-X	Qwen2.5-Math-PRM-7B	-	-	-	-	-	-	-	-	-	-	82.7	82.0	72.0	67.3	66.7	54.5	72.7	75.6
ActPRM	Qwen2.5-Math-7B	-	-	-	-	-	-	-	-	-	-	81.6	79.8	71.4	67.0	65.5	53.6	71.3	75.2
RefCritic-R1-14B	DeepSeek-R1-Distill-Qwen-14B	-	-	-	-	-	-	-	-	-	-	86.3	82.0	67.6	72.3	-	-	-	-
RefCritic-Qwen-14B	Qwen2.5-14B-Instruct	-	-	-	-	-	-	-	-	-	-	81.9	71.2	58.1	60.7	-	-	-	-
FlexiVe (Think@64)	DeepSeek-R1-Distill-Qwen-14B	-	-	-	-	-	-	-	-	-	-	88.1	90.1	86.7	80.4	-	-	-	-
FlexiVe (Flex@128)	DeepSeek-R1-Distill-Qwen-14B	-	-	-	-	-	-	-	-	-	-	83.0	85.0	80.0	75.2	-	-	-	-
GenPRM-32B (Maj@8)	Qwen2.5-32B	-	-	-	-	-	-	-	-	-	-	85.1	86.3	78.9	80.1	-	-	-	-
GenPRM-7B (Maj@8)	Qwen2.5-7B	-	-	-	-	-	-	-	-	-	-	81.0	85.7	78.4	76.8	-	-	-	-
SPC (Round 2)	Qwen2.5-7B-Instruct	-	-	-	-	-	-	60.5	59.5	68.2	52.8	-	-	-	-	-	-	-	-
SPC (Round 1)	Qwen2.5-7B-Instruct	-	-	-	-	-	-	58.8	57.3	68.4	49.3	-	-	-	-	-	-	-	-
SPC (Round 0)	Qwen2.5-7B-Instruct	-	-	-	-	-	-	54.9	53.5	45.9	64.0	-	-	-	-	-	-	-	-
Qwen2.5-Math-7B-PRM800K	Qwen2.5-Math-7B	-	-	-	-	-	-	58.5	41.3	90.1	26.8	68.2	62.6	50.7	44.3	-	-	-	-
Pure-PRM-7B	Qwen2.5-Math-7B	-	-	-	-	-	-	-	-	-	-	69.0	66.5	48.4	45.9	65.3	52.2	70.2	75.8
Skywork-PRM-7B	Qwen2.5-Math-7B	38.52	34.12	56.77	29.81	11.56	8.36	-	-	-	-	70.8	53.6	22.9	21.0	65.1	59.6	68.5	73.3
Math-Shepherd-PRM-7B	Mistral-7B	-	-	-	-	-	-	53.3	14.3	7.69	98.8	47.9	29.5	24.8	23.8	47.0	47.1	45.7	60.7
RLHFlow-PRM-Mistral-8B	Llama-3.1-8B	-	-	-	-	-	-	-	-	-	-	50.4	33.4	13.8	15.8	54.4	46.7	57.5	68.5
ReasonEval-34B	CodeLlama-34B	-	-	-	-	-	-	-	-	-	-	-	-	-	-	60.5	51.5	63.0	73.1
ReasonFlux-PRM-7B	DeepSeek-R1-Distill-Qwen-7B	53.09	22.40	55.89	53.82	42.48	28.71	-	-	-	-	-	-	-	-	-	-	-	-
uPRM	Qwen2.5-Math-7B	-	-	-	-	-	-	-	-	-	-	58.3	52.6	42.7	39.8	-	-	-	-

Notes. Hard2Verify reports balanced accuracy and F1 for step, response, and error-identification tasks; DeltaBench reports average, HM, correct-step recall, and error-step recall; ProcessBench and PRMBench report their original F1/PRMScore metrics.

Stage III: Task-Closure Evaluation

Interactive task success and pass-rate evaluation across SWE, terminal, OS, web, and browsing benchmarks.

Model / Method	Base Model	SWE-V	Terminal 2.0	OSWorld-V	WebArena-V	BrowseComp	MCP-Atlas
GPT-5.4 xHigh	-	-	75.1	75.0	67.3	82.7	67.2
Claude Opus 4.6 Max	-	80.8	65.4	-	-	83.7	73.8
Gemini 3.1 Pro High	-	80.6	68.5	-	-	85.9	69.2
DeepSeek-V4-Pro Max	-	80.6	67.9	-	-	83.4	73.6
Kimi-K2.6 Thinking	-	80.2	66.7	-	-	83.2	66.6
GLM-5.1 Thinking	-	-	63.5	-	-	79.3	71.8
UI-TARS-2	UI-TARS-2	68.7	45.3*	-	-	29.6*	-
OpenCUA-72B	Qwen2.5-VL-72B	-	-	45.0	-	-	-
SWE-Exp	Claude 4 Sonnet	73.0	-	-	-	-	-
Kimi-Dev	Qwen2.5-72B-Base	60.4	-	-	-	-	-
SWE-Master-32B-RL	Qwen2.5-Coder-32B	61.4	-	-	-	-	-
PDR+RTV	Gemini 3.1 Pro	76.60	64.77	-	-	-	-
TACT-GATE	Qwen3.5-27B	73.3	36.0	-	-	-	-
IHR+NLAH	GPT-5.4-mini	73.0	53.9	-	-	-	-
Polar RL (Pi)	Qwen3.5-4B	40.4	-	-	-	-	-
SA-SWE-32B	Qwen3-32B	39.4	16.25	-	-	19.4 (+)	-
CODESKILL	Qwen3.5-35B-A3B	66.0	34.12	-	-	-	-
CodeScout-14B	Qwen3-Coder-30B-A3B	46.00	-	-	-	-	-
TACO	MiniMax-M2.5	-	44.16	-	-	-	-
ComputerRL	GLM-4.1V-9B-Thinking	-	-	48.0	-	-	-
UltraCUA-32B-RL	UltraCUA-32B	-	-	43.7	-	-	-
OS-Symphony	GPT-5	-	-	65.8	-	-	-

Notes. SWE-V denotes SWE-bench Verified; OSWorld-V denotes OSWorld-Verified; WebArena-V denotes WebArena-Verified; Terminal 2.0 denotes Terminal-Bench v2.0.

Stage IV: Workspace and OpenClaw Evaluation

Workspace-level evaluation, OpenClaw metrics, and trajectory-level safety where lower attack success rate is better.

Model / Method	Setting	Claw-Eval		ClawBench	ClawsBench			ATBench-Claw			ClawSafety
Model / Method	Setting	Gen.	Multi	SR	TSR	UAR	SCR	Acc.	F1	Rec.	ASR ↓
Claude Opus 4.6	OpenClaw on/on	70.8	68.4	-	63	23	50	-	-	-	-
Claude Sonnet 4.6	OpenClaw / ClawSafety	68.3	65.8	33.3	56	13	48	-	-	-	40.0
MiMo-V2.5-Pro	Claw-Eval	64.0	63.2	-	-	-	-	-	-	-	-
GLM-5.1	Claw-Eval	62.7	60.5	-	-	-	-	-	-	-	-
Muse Spark	Claw-Eval	62.7	68.4	-	-	-	-	-	-	-	-
Kimi K2.6	Claw-Eval	61.5	65.8	-	-	-	-	-	-	-	-
GPT-5.4	OpenClaw on/on	60.2	60.5	6.5	53	7	41	-	-	-	-
DeepSeek V4 Pro	Claw-Eval	58.4	65.8	-	-	-	-	-	-	-	-
Qwen3.6 Plus	Claw-Eval	57.1	65.8	-	-	-	-	-	-	-	-
Qwen3.5-397B-A17B	AgentDoG prompt	57.8	52.6	-	-	-	-	83.8	86.5	87.5	-
GLM-5	OpenClaw text-only / on/on	-	-	24.2	60	23	48	-	-	-	-
Gemini 3 Flash	ClawBench	-	-	19.0	-	-	-	-	-	-	-
Claude Haiku 4.5	ClawBench	-	-	18.3	-	-	-	-	-	-	-
Gemini 3.1 Flash-Lite	OpenClaw on/on	-	-	3.3	39	23	26	-	-	-	-
Kimi K2.5	OpenClaw / ClawSafety	52.8	50.0	0.7	-	-	-	-	-	-	60.8
Gemini 3.1 Pro	OpenClaw on/on	55.9	65.8	-	58	10	48	-	-	-	-
Qwen3Guard-Gen-8B	Guard model	-	-	-	-	-	-	52.1	36.3	23.1	-
Llama-Guard-4-12B	Guard model	-	-	-	-	-	-	74.4	73.4	60.0	-
ShieldAgent	Guard model	-	-	-	-	-	-	68.1	60.1	43.3	-
Llama-3.3-70B-Instruct	AgentDoG prompt	-	-	-	-	-	-	80.6	82.3	76.4	-
AgentDoG-Qwen3-4B	AgentDoG	-	-	-	-	-	-	87.2	89.6	92.9	-
Gemini 2.5 Pro	OpenClaw	-	-	-	-	-	-	-	-	-	55.0
DeepSeek V3	OpenClaw	-	-	-	-	-	-	-	-	-	67.5
GPT-5.1	OpenClaw	-	-	-	-	-	-	-	-	-	75.0
Claude Sonnet 4.6 + Nanobot	Nanobot scaffold	-	-	-	-	-	-	-	-	-	48.6
Claude Sonnet 4.6 + NemoClaw	NemoClaw scaffold	-	-	-	-	-	-	-	-	-	45.8

Notes. Claw-Eval reports general and multi-turn PassAll³; ClawsBench reports task success, unsafe action, and safe completion rates; ClawSafety reports attack success rate, where lower is better.

Future

Challenges and Future Directions

Reliable autonomy introduces failures that are longer-horizon, stateful, harder to reverse, and tied to real workspace changes. Future systems need stronger closure, safer permissions, robust memory, and governance.

Open challenges for reliable autonomy. — Open challenges for reliable autonomy: task closure, safety and governance, memory, context management, and persistent workspace state.

Future directions toward self-evolving AI ecosystems. — Future directions toward self-evolving AI ecosystems that accumulate experience and improve their operating environments.

Open Challenges: Making Autonomy Reliable

Current agents can look impressive in isolated demonstrations, but dependable digital work requires stable performance across long horizons, safe operation under permission constraints, and persistent memory that does not collapse as context grows. The challenge is not only stronger capability, but autonomous behavior that is auditable, recoverable, and controllable.

Long-Horizon Task Closure

Agents must maintain progress until the intended workspace state is achieved. Tool errors, partial failures, and early planning mistakes can compound across many actions, so future systems need progress monitoring, intermediate verification, self-healing, and rollback to safe checkpoints.

Safety and Governance

Once agents can touch files, browsers, APIs, terminals, databases, and enterprise applications, safety becomes operational. Fine-grained permissions, risk-aware action validation, audit trails, sandboxing, and calibrated human approval are required for high-impact actions.

Memory and Persistent State

Long-running work needs more than a transient context window. Agents need working memory for the current trajectory, episodic memory for past interactions, procedural memory for reusable skills, and external state stores that preserve decision-critical facts while discarding noise.

Future Directions: Toward Self-Evolving AI Ecosystems

The next stage is likely to be defined not only by larger foundation models, but by the ecosystems around them. Capability becomes distributed across prompts, context pipelines, tools, memories, workspaces, evaluators, skills, policies, and governance layers. In this view, operational experience becomes the material from which systems improve.

Prompt, Context, and Harness Engineering

Prompts specify intent, context supplies task-relevant state, and harnesses define the operational world where actions have consequences. Future progress depends on execution substrates that make actions inspectable, constrained, replayable, and learnable.

AI-Native Workspaces

A workspace gives the model a digital body: files, terminals, browsers, repositories, logs, snapshots, permissions, provenance, replay, rollback, and final-state diffs. These primitives make both safety and learning possible.

Beyond-Gradient Learning

Improvement can happen outside model weights. A failed tool call can become a better wrapper, a repeated correction can become memory, a successful trajectory can become a skill, and a benchmark failure can become a regression test.

Composable Skills and Multi-Agent Work

Skills need interfaces, versions, dependency contracts, tests, documentation, security review, and deprecation. Multi-agent systems also need shared state, ownership boundaries, escalation protocols, and final-state verifiers.

Toward Self-Evolving AI Ecosystems

A governed self-evolving loop turns operational traces into durable assets: successful trajectories become reusable skills, failures become regression tests, user corrections become memories, tool errors become wrapper updates, and safety incidents become policies. Every consequential action should become evidence, and every durable update should be validated, versioned, auditable, reversible, and deployed under explicit permission boundaries.

Resources

Access the paper, code, citation, project page, and contact channel for Next-Generation AI Systems.

PaperArxiv Link

CodeWebsite

ProjectGitHub Pages

Contactzyhbrz@gmail.com

Citation

Use the BibTeX below to cite this work.

@misc{zhang2026chatbotdigitalcolleagueparadigm,
      title={From Chatbot to Digital Colleague: The Paradigm Shift Toward Persistent Autonomous AI}, 
      author={Yongheng Zhang and Ziang Liu and Jiaxuan Zhu and Shuai Wang and Xiangqi Chen and Haojing Huang and Jiayi Kuang and Siyu Chen and Ao Shen and Hao Wu and Qiufeng Wang and Qian-Wen Zhang and Junnan Dong and Wenhao Jiang and Ying Shen and Hai-Tao Zheng and Yinghui Li and Di Yin and Xing Sun and Philip S. Yu},
      year={2026},
      eprint={2606.14502},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2606.14502}, 
}