Foundation model


In artificial intelligence, a foundation model, also known as large x model, is a machine learning or deep learning model trained on vast datasets so that it can be applied across a wide range of use cases. Generative AI applications like large language models are common examples of foundation models.
Building foundation models is often highly resource-intensive, with the most advanced models costing hundreds of millions of dollars to cover the expenses of acquiring, curating, and processing massive datasets, as well as the compute power required for training. These costs stem from the need for sophisticated infrastructure, extended training times, and advanced hardware, such as GPUs. In contrast, adapting an existing foundation model for a specific task or using it directly is far less costly, as it leverages pre-trained capabilities and typically requires only fine-tuning on smaller, task-specific datasets.
Early examples of foundation models are language models like OpenAI's GPT series and Google's BERT. Beyond text, foundation models have been developed across a range of modalitiesincluding DALL-E, Stable diffusion, and Flamingo for images, MusicGen and LLark for music, and RT-2 for robotic control. Foundation models are also being developed for fields like astronomy, radiology, genomics, coding, times-series forecasting, mathematics, and chemistry.

Definitions

The Stanford Institute for Human-Centered Artificial Intelligence's Center for Research on Foundation Models coined the term "foundation model" in August 2021 to mean "any model that is trained on broad data that can be adapted to a wide range of downstream tasks". This was based on their observation that preexisting terms, while overlapping, were not adequate, stating that "'large language model' was too narrow given the focus is not only language; 'self-supervised model' was too specific to the training objective; and 'pretrained model' suggested that the noteworthy action all happened after 'pretraining." The term "foundation model" was chosen over "foundational model" because "foundational" implies that these models provide fundamental principles in a way that "foundation" does not. The term vision-language model is also used as a near-synonym.
As governments regulate foundation models, new legal definitions have emerged.
  • In the United States, the Executive Order on the Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence defines a foundation model as "an AI model that is trained on broad data; generally uses self-supervision; contains at least tens of billions of parameters; is applicable across a wide range of contexts".
  • In the United States, the proposed AI Foundation Model Transparency Act of 2023 by House Representatives Don Beyer and Anna Eshoo defines a foundation model as "an artificial intelligence model trained on broad data, generally uses self supervision, generally contains at least 1,000,000,000 parameters, is applicable across a wide range of contexts, and exhibits, or could be easily modified to exhibit, high levels of performance at tasks that could pose a serious risk to security, national economic security, national public health or safety, or any combination of those matters."
  • In the European Union, the European Parliament's negotiated position on the E.U. AI Act defines a foundation model as an "AI model that is trained on broad data at scale, is designed for generality of output, and can be adapted to a wide range of distinctive tasks".
  • In the United Kingdom, the Competition and Markets Authority's AI Foundation Models: Initial Report defines foundations model as "a type of AI technology that are trained on vast amounts of data that can be adapted to a wide range of tasks and operations."
The United States's definitions are the only ones to make reference to the size of a foundation model, and differ on magnitude. Beyer and Eshoo's definition also specifies that foundation models must achieve a level of performance as to be a potential danger. In contrast, the E.U. definition requires the model to be designed for generality of output. All definitions agree that foundation models must be trained on a broad range of data with potential applications in many domains.

History

Technologically, foundation models are built using established machine learning techniques like deep neural networks, transfer learning, and self-supervised learning. Foundation models differ from previous techniques as they are general purpose models that function as a reusable infrastructure, instead of bespoke and one-off task-specific models.
Advances in computer parallelism and new developments in neural network architecture, and the increased use of training data with minimal supervision all contributed to the rise of foundation models. Foundation models began to materialize as the latest wave of deep learning models in the late 2010s. Relative to most prior work on deep learning, these language models demonstrated the potential of training on much larger web-sourced datasets using self-supervised objectives. These approaches, which draw upon earlier works like word2vec and GloVe, deviated from prior supervised approaches that required annotated data.
The 2022 releases of Stable Diffusion and ChatGPT led to foundation models and generative AI entering widespread public discourse. Further, releases of LLaMA, Llama 2, and Mistral in 2023 contributed to a greater emphasis placed on how foundation models are released with open foundation models garnering a lot of support and scrutiny.

Related concepts

Frontier models

Certain highly advanced foundation models are termed "frontier models", which have the potential to "possess dangerous capabilities sufficient to pose severe risks to public safety." These "dangerous capabilities" stem from the accidental or intentional misuse of such models, which in conjunction with their powerful nature can lead to severe harms. As foundation models continue to improve, some AI researchers speculate that almost all next-generation foundation models will be considered frontier models.
Since the concept of dangerous capabilities is inherently subjective, there is no strict designation for what foundation models qualify as frontier models. However, some generally held ideas for sufficiently dangerous capabilities include:
  • Designing and synthesizing new biological or chemical weapons
  • Producing and propagating convincing, tailored disinformation with minimal user instruction
  • Harnessing unprecedented offensive cyber capabilities
  • Evading human control through deceptive means
Due to frontier models' unique capabilities, it is difficult to effectively regulate their development and deployment. Because of their emergent nature, new dangerous capabilities can appear on their own in frontier models, both in the development stage and after being deployed. Additionally, since frontier models continue to adapt after deployment, it remains difficult to mitigate all harms that arise from already-deployed models. If a frontier model happens to be open-source or is released online, the model can also disseminate rapidly, further hampering regulators by creating a lack of accountability.

General-purpose AI

Due to their adaptability to a wide range of use-cases, foundation models are sometimes considered to be examples of general-purpose AI. In designing the EU AI Act, the European Parliament has stated that a new wave of general-purpose AI technologies shapes the overall AI ecosystem. The fuller structure of the ecosystem, in addition to the properties of specific general-purpose AI systems, influences the design of AI policy and research. General-purpose AI systems also often appear in people's everyday lives through applications and tools like ChatGPT or DALL-E.
Government agencies like EU Parliament have identified regulation of general-purpose AI, such as foundation models, to be a high priority. General-purpose AI systems are often characterized by large size, opacity, and potential for emergence, all of which can create unintended harms. Such systems also heavily influence downstream applications, which further exacerbates the need for regulation. In regards to prominent legislation, a number of stakeholders have pushed for the EU AI Act to include restrictions on general-purpose AI systems, all of which would also apply to foundation models.

World models

World models are sometimes described as foundation models. World models are a representation of an environment intended to predict the state of that environment after taking a set of actions, as well as to implicitly model physical concepts such as gravity. Input prompts for world models can include text or images, as well as videos or 3D scenes, and the resulting 3D environments can be exported. World models, alongside embodied AI, multi-agent models, and neuroscience models of the brain, are seen as alternatives to large language models for achieving general artificial intelligence.
World models do not have a fully agreed definition, but have been divided into two scopes: one for representing and understanding the current environment, and another for predicting the future state of that environment. In the former view, world models are developed using model-based reinforcement learning and a Markov decision process, using model predictive control or Monte Carlo tree search to create policies. With the latter, large language models or video generation models can be used. In addition, these environments can be immersive simulations for training AI agents that can interact in the real world.

History

Quanta Magazine traced world models back to a 1943 publication by Kenneth Craik on mental models and the blocks world of SHRDLU in the 1960s. Business Insider traced world models to a 1971 paper by Jay Wright Forrester. A related idea of organizing world knowledge, the frame representation, was proposed by Marvin Minsky in 1974.
In 2018, researchers David Ha and Jürgen Schmidhuber defined world models in the context of reinforcement learning: an agent with a variational autoencoder model V for representing visual observations, a recurrent neural network model M for representing memory, and a linear model C for making decisions. They suggested that agents trained on world models in environments that simulate reality could be applied to real world settings.
In 2022, Yann LeCun saw a world model as part of a larger system of cognitive architecture – other neural networks that are analogous to different regions of the brain. In his view, this framework could lead to commonsense reasoning. LeCun has estimated that world models would be fully functional by the late 2020s to mid 2030s.