As intelligent language model-based systems proliferate, the initial environments that these models operated within were limited, often taking the form of chat-based assistants (i.e., ChatGPT) that relay text back and forth via an open-prompted conversational experience. While great for end users to begin forming an understanding of LLMs' intelligence, this limited the model's action space to just text generation. It was quickly discovered that the text generation and reasoning abilities could be extended to take actions via function/tool calling, an additional fine-tuning technique where function schemas (name, purpose, and parameters) are provided alongside regular text inputs. This expanded the action space by allowing LLMs to "use" a tool by identifying and populating the parameters of the chosen tool, which is then executed in a separate environment with the results returned to the conversation context. While this does expand the scope of an LLM's ability to interact with (and change!) the world it lives in, it still remained that the model operates in a limited chat-based interface or routinely executed invocation.
However, as we consider the nature of large language models and their interaction with the world, the parallels between the way humans and LLMs operate become more apparent. Early research such as ReACT has persisted and exemplified an LLM's ability to mimic human-like execution by reasoning, acting, and reflecting on the task at hand. While this high-level idea has continued to be implemented in various agentic approaches, the core philosophy can be generalized toward the LLM's operating environment. This has led to expanding the environments of LLMs to more continuous digital scenarios, namely browsers and desktops, that take further advantage of LLMs' human-like behavior by directly integrating them into common human-facing digital worlds. This expansion builds toward long-running, multi-step task execution for language models while exponentially increasing their action space complexity. Overarchingly, this addresses the idea that if LLMs are meant to mimic human behavior and intelligence, then we should integrate them with environments and tools to act like humans, rather than using traditional programmatic approaches.
Action Space
So what do we mean by the 'action space of an LLM'? This requires a light refresher of reinforcement learning, where in classic RL there exists a primary decision maker called the agent that interacts with and learns from the environment it exists in. Note: this is not to be confused with the currently popular, yet slightly different, definition of an agent that broadly refers to AI systems that perform tasks autonomously. The environment, then, defines the rules, dynamics, and constraints of the problem the agent is trying to solve. In other words it defines the agent's world. Notably, this includes a state space and action space with the first encompassing all possible situations the environment can be in, and the latter all possible actions the agent can make. As an agent performs an action in the environment it receives a signal in the form of a state observation and reward. The primary objective of the agent then is to maximize the rewards it gets in the world it lives in through taking actions, observing the outcome, and learning from a trial-and-error approach what works best, or gets the biggest reward.
Extrapolating this we can begin to define more specifically the action space of an LLM. In their original and most basic text completions environment the only action was to generate text given an input, but in the environment of platforms like ChatGPT the LLM can either generate text or use one of multiple tools: web search, image generation, execute code, create canvas, kick off specific workflows like deep research or agent mode, etc. As mentioned, these tools are defined by the experience creators and exposed to the LLM which can dynamically choose how and what tool to use. Similar to an RL setup, the chosen tool is executed and its result returned back to the LLM so it may observe the outcome (current state) and determine a followup (action). We then rely on the general knowledge and reasoning capabilities of a language model to use all actions it has available at the time and their results to work towards the goal set by the user. Feedback is generally provided by both a user's positive or negative response, as well as an intrinsic understanding of whether progress has been made towards the request. Contrary to defined RL environments, language models take a 'generalized approach' to operating in their world. This is enabled by their quality as few-shot learners, meaning they can learn traits or behaviors by observing a few examples without any underlying policy changes via continued training. This allows foundation language models to be inserted into many different environments, given a set of possible actions as tools, and work towards completing a desired goal.
Ultimately, we are left with a generally intelligent agent whose action space and environment can be dynamically redefined without requiring any retraining.
Human-Digital Environments
When considering these traits, we can draw connections to cognitive development and human learning. As we are introduced to and grow within the world, we interact with our environment and over time develop an understanding of how things work. We use the tools available to us through both verbal, mental, and physical actions, observe what happens, then learn from those experiences. Large language models have inherited many of these traits as an emergent result of scaled pretraining, further reinforced through task specific fine tuning. With an understanding of the similarities, we must challenge traditional software approaches with LLM systems in digital environments, specifically around how LLMs integrate with existing software. This idea has already been established with the term 'copilot' being coined a couple years ago, yet we are still lacking reliable integration into some of the most common digital environments. While the problem of building a reliable 'AI' system within a piece of software is certainly not a trivial issue to solve, it may be that software development is not tackling it with a 'human' approach.
As an illustrative example, consider the argument for humanoid robotics. Despite not being the most efficient design in itself, a humanoid robot can fit much more easily into the current human-centric world. Most environments are designed for the human form, thus a humanoid robot can operate better within the environment if given the same available actions that a human has. This dramatically simplifies training, assessing, and data gathering processes as we can use everyday observations that have been proven over time to work. The same can be proposed for not just physical, but digital environments as well. When we use a computer, we don't (for the most part) directly interact with the computer language, rather use graphical interfaces and a set number of actions to complete a given task. Accordingly, if LLMs parallel human behavior and cognition, then they should be able to operate within the same digital environments in a similar manner.
Computer Use
This brings us to the idea of computer use which, put simply, involves giving an LLM a computer environment and all expected navigation actions. The language model then takes an input task and goes forth into the digital world to try and complete it, simulating very closely how a person approaching the same task might behave. This bypasses any need for complex integrations or APIs by very literally presenting the same front end digital experience directly to the LLM and having the model interact within the computer environment. This can, of course, be as open ended or as limited as desired with some approaches giving full containerized desktops preloaded with a variety of applications, others providing just a web browser, or most simply a single application interface. The key difference that I've noticed, however, is that the information tends to be observed visually by the model. Actions and their results are often screenshot and passed as pictures for the LLM to observe and plan its next move, further increasing the similarities between human and LLM behavior. To increase the success of completing multi-step, long horizon tasks an emphasis on planning, reflection, and memory are included with each action triggering an observation of what's happened, how it's contributed towards completing the task, and contextualized by prior action+reflection combos. This approach and environment further enables language models to behave and operate in human-made digital environments, which may better align with their human based understanding of the world.