Two trends are determining the end of "New Modelitis", the tendency of researchers and practitioners to focus on new model architectures with mostly marginal gains, rather than other, potentially more impactful, aspects of the machine learning pipeline, such as data quality and evaluation.
Model building platforms like Ludwig and Overton enforce commoditized architectures, and move towards ML systems that can be created declaratively Molino and Ré 2021. These platforms show that commoditiy models can perform even better than their tuned predecessors!
With the ability to train models on unlabeled data, research is scaling up both data and model size at an impressive rate. With access to such massive amounts of data, the question has shifted from “how to construct the best model” to “how do you feed these models”.
Both trends are supported by results from Kaplan et al., who show that the architecture matters less, and the real lift comes from the data.
- Over the last few years, the natural language processing community has landed on the Transformer, explained very well in this blog, as being the commoditized language architecture. The vision community landed on the convolutional neural network, explained more in this blog.
- Ramachandran et al showed that the CNN and self-attention block in Transformers could actual be the same, and this was capitalized with the Vision Transformers that used a Transformer to train an image classification model to achieve near or above state-of-the-art results.
- These architecutres are still complex and expensive to use. Researchers missed the good-old-days of MLP layers. Exciting recent work has shown that even the Transformer can be replaced by a sequence of MLPs in gMLP and MLP-Mixer.
As the goal is to feed as much knowledge to these commoditized models as possible, recent work has explored multi-modal applications that use both vision and text data.
- Wu Dao 2.0 is the Chinese 1.75T parameter MoE model with multimodal capabilities.
- DALL-E & CLIP are two other multi-modal models.
Other groups are trying to curate a better pretraining dataset
- The Pile is a new massive, more diverse dataset for training language models than the standard Common Crawl.
- Huggingface BigScience is a new effort to establish good practices in data curation.
More focus is being put into the kind of tokenization strategies can be used to further unify these models. While language tasks typically deal with WordPiece tokens or BytePairEncoding (BPE) tokens, recent work explores byte-to-byte strategies that work on individual bytes or characters, requiring no tokenization. In vision, tokens are usually patches in the image.
- Limitations of Autoregressive Models and Their Alternatives explores the theoretical limitations of autoregressive language models in the inability to represent "hard" language distributions.
- Provable Limitations of Acquiring Meaning from Ungrounded Form: What Will Future Language Models Understand? explores the theoretical limitations of autoregressive language models in the inability to represent "hard" language distributions.
- Companies like OpenAI, Anthropic, Cohere see building universal models as part of their core business strategy.
- Lots of companies emerging that rely on APIs from these universal model companies to build applications on top e.g. AI Dungeon. A long list from OpenAI at this link.