post-thumb

How Open is Generative AI? Part 2

Embarking on the second and last part of our ‘Generative AI Openness’ series, we earlier established a straightforward framework to gauge the openness of Large Language Models (LLMs) and utilized it to explore LLM development and the positioning of key players. We noticed a trend towards increasingly restricted LLM artifacts for OpenAI and Google, contrasted with Meta’s more open approach.

LLM Training Steps

Now, let’s venture into the realm of collaboration and reuse, keeping our openness matrix in mind, to uncover the multifaceted nature of openness in LLMs.

Before we delve into the specifics of different models, their development, and openness, let’s start by considering the openness of some well-known LLMs from a broad perspective.

LLM Model Openness in a Nutshell

LLMModel (weights)Pre-training DatasetFine-tuning DatasetReward ModelData Processing Code
Alpaca3 - Open with limitations1 - Published research only2 - Research use onlyNot applicable4 - Under Apache 2 license
Vicuna3 - Open with limitations1 - Published research only2 - Research use onlyNot applicable4 - Under Apache 2 license
GPT-J, GPT-Neo4 - Completely open3 - Open with limitationsNot applicableNot applicable4 - Completely open
Falcon3 - Open with limitations4 - Access and reuse without restrictionNot applicableNot applicable1 - No code available
BLOOM3 - Open with limitations3 - Open with limitationsNot applicableNot applicable4 - Completely open
OpenLLaMa4 - Access and reuse without restriction4 - Access and reuse without restrictionNot applicableNot applicable1 - No complete data processing code available
MistralAI4 - Access and reuse without restriction0 - No public information or accessNot applicableNot applicable4 - Complete data processing code available
Dolly4 - Access and reuse without restriction3 - Open with limitations4 - Access and reuse without restriction0 - No public information available4 - Access and reuse possible
BLOOMChat3 - Open with limitations3 - Open with limitations4 - Access and reuse without restriction0 - No public information available3 - Open with limitations
Zephyr4 - Access and reuse without restriction3 - Open with limitations3 - Open with limitations3 - Open with paper and code examples3 - Open with limitations
AmberChat4 - Access and reuse without restriction4 - Access and reuse without restriction2 - Research use only0 - No public information available4 - Under Apache 2 license

We often come across news about a new open-source LLM being released. However, upon closer examination, we find that accessing the model weights or using the model without restrictions is generally feasible. Nevertheless, it is often challenging to reproduce the work due to the unavailability of training datasets or missing data processing code. Additionally, the table shows that many models lack fine-tuning with a reward model, which is crucial for the success of current LLMs and plays a significant role in reducing hallucination and toxicity. Even in the case of the most open models, either no reward model is utilized or the reward model is not publicly accessible.

In the following sections, we will provide details of the LLM models mentioned in the table above, introduce their evolution, and explain their openness score.

1. Fine-tuned Models from Llama 2

Concluding the first part of this series, we highlighted two fine-tuned models based on Llama 2, subject to Meta’s licensing constraints. Let’s evaluate their openness level.

LLM Training Steps

Alpaca is an instruction-oriented LLM derived from LLaMA, enhanced by Stanford researchers with a dataset of 52,000 examples of following instructions, sourced from OpenAI’s InstructGPT through the self-instruct method. The extensive self-instruct dataset, details of data generation, and the model refinement code were publicly disclosed. This model complies with the licensing requirements of its base model. Due to the utilization of InstructGPT for data generation, it also adheres to OpenAI’s usage terms , which prohibit the creation of models competing with OpenAI. This illustrates how dataset restrictions can indirectly affect the resulting fine-tuned model.

Vicuna is another instruction-focused LLM rooted in LLaMA, developed by researchers from UC Berkeley, Carnegie Mellon University, Stanford, and UC San Diego. They adapted Alpaca’s training code and incorporated 70,000 examples from ShareGPT , a platform for sharing ChatGPT interactions.

Alpaca and Vicuna Openness

ComponentScoreLevel descriptionMotivation and links
Model (weights)3Open with limitationsBoth Vicuna and Alpaca are based on the Llama 2 foundational model.
Pre-training Dataset1Published research onlyBoth Vicuna and Alpaca are derived from the Llama 2 foundational model and dataset.
Fine-tuning Dataset2Research use onlyBoth models are constrained by OpenAI’s non-competition clause due to their training with data originating from ChatGPT.
Reward modelNANot ApplicableNeither model underwent Reinforcement Learning from Human Feedback (RLHF) initially, hence there are no reward models for evaluation. It’s worth noting that AlpacaFarm , a framework simulating an RLHF process, was released under a non-commercial license, and StableVicuna underwent RLHF fine-tuning on Vicuna.
Data Processing Code4Under Apache 2 licenseBoth projects have shared their code on GitHub (Vicuna , Alpaca ).

Significantly, both projects face dual constraints: initially from LLaMA’s licensing on the model and subsequently from OpenAI due to their fine-tuning data.

2. Collaboration and Open Source in LLM Evolution

LLM Training Steps

In addition to the foundation model Llama and its associated families of fine-tuned LLMs, there are many initiatives that contribute to promoting the openness of foundational models and their fine-tuned ones.

2.1. Foundational Models and Pre-training Datasets

The research highlights the cost-effectiveness of developing instruction-tuned LLMs atop foundational models through collaborative efforts and reutilization. This approach necessitates the availability of genuinely open-source foundational models and pre-training datasets.

2.1.1 EleutherAI

This vision is in line with EleutherAI , a non-profit organization founded in July 2020 by a group of researchers. Driven by the perceived opacity and the challenge of reproducibility in AI, their goal was to create leading open-source language models.

LLM Training Steps

By December 2020, EleutherAI had introduced The Pile , a comprehensive text dataset designed for training models. Subsequently, tech giants such as Microsoft, Meta, and Google used this dataset for training their models. In March 2021, they revealed GPT-Neo , an open-source model under Apache 2.0 license, which was unmatched in size at its launch. EleutherAI’s later projects include the release of GPT-J , a 6 billion parameter model, and GPT-NeoX, a 20 billion parameter model, unveiled in February 2022. Their work demonstrates the viability of high-quality open-source AI models.

EleutherAI GPT-J Openness

ComponentScoreLevel descriptionMotivation and links
Model (weights)4Completely openGPT-J’s model weights are freely accessible, in line with EleutherAI’s commitment to open-source AI. EleutherAI GitHub
Pre-training Dataset3Open with limitationsGPT-J was trained on The Pile, a large-scale dataset curated by EleutherAI. While mostly open, parts of The Pile may have limitations. The Hugging Face page notes: “Licensing Information: Please refer to the specific license depending on the subset you use”
Fine-tuning DatasetNANot ApplicableGPT-J is a foundational model and wasn’t specifically fine-tuned on additional datasets for its initial release.
Reward modelNANot ApplicableGPT-J did not undergo RLHF, making this category non-applicable.
Data Processing Code4Completely openThe code for data processing and model training for GPT-J is openly available, fostering transparency and community involvement.

2.1.2 Falcon

In March 2023, a research team from the Technology Innovation Institute (TII) in the United Arab Emirates introduced a new open model lineage named Falcon , along with its dataset .

LLM Training Steps

Falcon features two versions: the initial with 40 billion parameters trained on one trillion tokens, and the subsequent with 180 billion parameters trained on 3.5 trillion tokens. The latter is said to rival the performance of models like LLaMA 2, PaLM 2, and GPT-4. TII emphasizes Falcon’s distinctiveness in its training data quality, predominantly sourced from public web crawls (~80%), academic papers, legal documents, news outlets, literature, and social media dialogues. Its licensing, based on the open-source Apache License , allows entities to innovate and commercialize using Falcon 180B, including hosting on proprietary or leased infrastructure. However, it explicitly prohibits hosting providers from exploiting direct access to shared Falcon 180B instances and its refinements, especially through API access. Due to this clause, its license does not fully align with OSI’s open-source criteria.

Falcon Openness

ComponentScoreLevel descriptionMotivation and links
Model (weights)3Open with limitationsFalcon’s license is inspired by Apache 2 but restricts hosting uses.
Pre-training Dataset4Access and reuse without restrictionThe RefinedWeb dataset is distributed under the Open Data Commons Attribution License (ODC-By) and also under the CommonCrawl terms, which are quite open.
Fine-tuning DatasetNANot ApplicableFalcon is a foundational model and can be fine-tuned on various specific datasets as per use case, not provided by the original creators.
Reward modelNANot ApplicableFalcon did not undergo RLHF in its initial training, hence no reward model for evaluation.
Data Processing Code1No code availableGeneral instructions are available here .

2.1.3 LAION

Highlighting the international scope of this field, consider LAION and BLOOM.

LLM Training Steps

LAION (Large-scale Artificial Intelligence Open Network), a German non-profit established in 2020, is dedicated to advancing open-source models and datasets (primarily under Apache 2 and MIT licenses) to foster open research and the evolution of benevolent AI. Their datasets, encompassing both images and text, have been pivotal in the training of renowned text-to-image models like Stable Diffusion.

2.1.4 BLOOM

LLM Training Steps

BLOOM , boasting 176 billion parameters, is capable of generating text in 46 natural and 13 programming languages. It represents the culmination of a year-long collaboration involving over 1000 researchers from more than 70 countries and 250 institutions, which concluded with a 117-day run on the Jean Zay French supercomputer. Distributed under an OpenRAIL license , BLOOM is not considered fully open-source due to usage constraints, such as prohibitions against harmful intent, discrimination, or interpreting medical advice.

BLOOM Openness

ComponentScoreLevel descriptionMotivation and links
Model (weights)3Completely openBLOOM’s model weights are publicly accessible, reflecting a commitment to open science but with an OpenRAIL license. BigScience GitHub
Pre-training Dataset3Open with limitationsBLOOM’s primary pre-training dataset, while officially under an Apache 2 license, is based on various subsets with potential limitations.
Fine-tuning DatasetNANot ApplicableBLOOM is a foundational model and wasn’t fine-tuned on specific datasets for its initial release.
Reward modelNANot ApplicableBLOOM did not undergo RLHF, hence no reward model for evaluation.
Data Processing Code4Completely openThe data processing and training code for BLOOM are openly available, encouraging transparency and community participation. BigScience GitHub

2.1.5 OpenLLaMA and RedPajama

RedPajama , initiated in 2022 by Together AI, a non-profit advocating AI democratization, playfully references LLaMA and the children’s book “Llama Llama Red Pajama”.

LLM Training Steps

The initiative has expanded to include partners like Ontocord.ai, ETH DS3Lab, Stanford CRFM, Hazy Research, and MILA Québec AI Institute. In April 2023, they released a 1.2 trillion token dataset , mirroring LLaMA’s dataset, for training their models. These models , with parameters ranging from 3 to 7 billion, were released in September, licensed under open-source Apache 2.

The RedPajama dataset was adapted by the OpenLLaMA project at UC Berkeley, creating an open-source LLaMA equivalent without Meta’s restrictions. The model’s later version also included data from Falcon and StarCoder. This highlights the importance of open-source models and datasets, enabling free repurposing and innovation.

OpenLLaMa Openness

ComponentScoreLevel descriptionMotivation and links
Model (weights)4Access and reuse without restrictionModels and weights (in PyTorch and JAX formats) are available under the Apache 2 open-source license.
Pre-training Dataset4Access and reuse without restrictionBased on RedPajama, Falcon, and StarCoder datasets.
Fine-tuning DatasetNANot ApplicableOpenLLaMA is a foundational model.
Reward modelNANot ApplicableOpenLLaMA did not undergo RLHF, hence no reward model for evaluation.
Data Processing Code1No complete data processing code availableThe Hugging Face page provides some usage examples with transformers.

2.1.6 MistralAI

LLM Training Steps

MistralAI , a French startup, developed a 7.3 billion parameter LLM named Mistral for various applications. Committed to open-sourcing its technology under Apache 2.0, the training dataset details for Mistral remain undisclosed. The Mistral Instruct model was fine-tuned using publicly available instruction datasets from the Hugging Face repository, though specifics about the licenses and potential constraints are not detailed. Recently, MistralAI released Mixtral 8x7B, a model based on the sparse mixture of experts (SMoE) architecture, consisting of several specialized models (likely eight, as suggested by its name) activated as needed.

Mistral and Mixtral Openness

ComponentScoreLevel descriptionMotivation and links
Model (weights)4Access and reuse without restrictionModels and weights are available under the Apache 2 open-source license.
Pre-training Dataset0No public information or accessThe training dataset for the model is not publicly detailed.
Fine-tuning DatasetNANot ApplicableMistral is a foundational model.
Reward modelNANot ApplicableMistral did not undergo RLHF, hence no reward model for evaluation.
Data Processing Code4Complete data processing code availableInstructions and deployment code are available on GitHub .

These examples illustrate the diverse approaches to openness in foundational models, emphasizing the value of sharing and enabling reuse by others.

2.2. Fine-tuned Models and Instruction Datasets

When pre-trained models and datasets are accessible with minimal restrictions, various entities can repurpose them to develop new foundational models or fine-tuned variants.

2.2.1 Dolly

Drawing inspiration from LLaMA and the fine-tuned projects like Alpaca and Vicuna, the big data company Databricks introduced Dolly 1.0 in March 2022.

LLM Training Steps

This cost-effective LLM was built upon EleutherAI’s GPT-J, employing the data and training methods of Alpaca. Databricks highlighted that this model was developed for under $30, suggesting that significant advancements in leading-edge models like ChatGPT might be more attributable to specialized instruction-following training data than to larger or more advanced base models. Two weeks later, Databricks launched Dolly 2.0 , still based on the EleutherAI model, but now exclusively fine-tuned with a pristine, human-curated instruction dataset created by Databricks’ staff. They chose to open source the entirety of Dolly 2.0, including the training code, dataset, and model weights, making them suitable for commercial applications. By May 2023, Databricks had acquired MosaicML, a company that had just released its MPT (MosaicML Pre-trained Transformer) foundational model under the Apache 2.0 license, allowing anyone to train, refine, and deploy their own MPT models.

Dolly Openness

ComponentScoreLevel descriptionMotivation and links
Model (weights)4Access and reuse without restrictionDolly 2 is based on the EleutherAI foundational model and is released under the MIT license.
Pre-training Dataset3Open with limitationsDolly 2 is built upon the EleutherAI foundational model and inherits the same limitations as its dataset (The Pile).
Fine-tuning Dataset4Access and reuse without restrictionThe fine-tuning dataset was created by Databricks employees and released under the CC-BY-SA permissive license.
Reward model0No public information availableThe Reward model of Dolly is not publicly disclosed.
Data Processing Code4Access and reuse possibleCode to train and run the model is available on GitHub under the Apache 2 license.

2.2.2 OpenAssistant

OpenAssistant stands as another example of the potential of open source fine-tuning datasets.

LLM Training Steps

This initiative aims to create an open-source, chat-based assistant proficient in understanding tasks, interacting with third-party systems, and retrieving dynamic information. Led by LAION and international contributors, a unique aspect of this project is its reliance on crowdsourcing for data collection from human volunteers. The OpenAssistant team has initiated numerous crowdsourcing tasks to gather data for various tasks, such as generating diverse text formats—from poems and code to emails and letters—and providing informative responses to queries. Recently, the LAION initiative also released an open-source dataset named OIG (Open Instruction Generalist), comprising 43M instructions, focusing on data augmentation rather than human feedback.

2.2.3 BLOOMChat

LLM Training Steps

BLOOMChat , a fine-tuned chat model with 176 billion parameters, demonstrates the reuse of previous models and datasets. It underwent instruction tuning based on the BLOOM foundational model, incorporating datasets like OIG, OpenAssistant, and Dolly.

BLOOMChat Openness

ComponentScoreLevel descriptionMotivation and links
Model (weights)3Open with limitationsBased on the BLOOM foundational model, it inherits its restrictions (Open RAIL license).
Pre-training Dataset3Open with limitationsBased on the BLOOM foundational model and dataset, it inherits potential restrictions.
Fine-tuning Dataset4Access and reuse without restrictionThe Dolly and LAION fine-tuning datasets are open source.
Reward model0No public information availableThe Reward model of BLOOMChat is not publicly disclosed.
Data Processing Code3Open with limitationsCode to train and run the model is available on GitHub under an Open RAIL type license.

2.2.4 Zephyr

LLM Training Steps

Zephyr serves as another exemplary case of reutilizing an open foundational model. It is a fine-tuned version of the Mistral 7B model developed by the Hugging Face H4 project. It was fine-tuned using filtered versions of the UltraChat and UltraFeedback datasets. These datasets were generated using ShareGPT and GPT-4, potentially imposing some transitive use restrictions on the resulting model.

Zephyr Openness

ComponentScoreLevel descriptionMotivation and links
Model (weights)4Access and reuse without restrictionZephyr is released under the MIT license, and the Mistral foundational model is under Apache 2.
Pre-training Dataset3Open with limitationsInherits possible restrictions from the Mistral foundational model and dataset.
Fine-tuning Dataset3Open with limitationsShareGPT and GPT-4 were used to produce the fine-tuning datasets, imposing limitations.
Reward model3Open with paper and code examplesZephyr was fine-tuned using Direct Preference Optimization (DPO), not RLHF. A paper and some code and examples on this new technique are available.
Data Processing Code3Open with limitationsExample code to train and run the model is available on Hugging Face .

2.2.5 LLM360

LLM Training Steps

LLM360 is an emerging yet intriguing initiative focused on ensuring complete openness of all essential elements required for training models. This includes not only model weights but also checkpoints, training datasets, and source code for data preprocessing and model training. The project is a collaborative effort between the Mohamed bin Zayed University of Artificial Intelligence (MBZUAI) in the UAE and two American AI companies. It has introduced two foundational models: Amber, a 7 billion parameter English LLM, and CrystalCoder, a 7 billion parameter LLM specializing in code and text. Additionally, there are fine-tuned versions of these models, including AmberChat.

AmberChat openness

ComponentScoreLevel descriptionMotivation and links
Model (weights)4Access and reuse of model is possible without restrictionAmberChat is released under the Apache 2 open source license.
Pre-training Dataset4Access and reuse without restrictionBased on RedPajama v1, Faclcon RefineWeb and StarCoderData.
Fine-tuning Dataset2Research use onlyAmberChat dataset is based on WizardLM evol instruct V2 and share gpt 90k both based in ShareGPT.
Reward model0Reward model ot availableThe reward model is not available. DPO is mentioned as a alignement method for AmberSafe another fine-tuned Amber model.
Data Processing Code4Open with limitationsData processing and model training code are available under the Apache 2 open source license.

An assessment of the openness of AmberChat reveals that while the Amber foundational model is quite open, providing unrestricted access to weights, dataset, and code, the fine-tuning process in AmberChat introduces certain limitations to this initial level of openness.


These diverse examples highlight how openness enables global collaboration, allowing entities to repurpose earlier models and datasets in ways typical of the research community. This approach clarifies licensing nuances and fosters external contributions. And it’s this spirit of collaborative and community-driven innovation that is championed by a notable player: Hugging Face.

2.3. The Hugging Face Ecosystem

LLM Training Steps

The Hugging Face initiative represents a community-focused endeavor, aimed at advancing and democratizing artificial intelligence through open-source and open science methodologies. Originating from the Franco-American company, Hugging Face, Inc., it’s renowned for its Transformers library , which offers open-source implementations of transformer models for text, image, and audio tasks. In addition, it provides libraries dedicated to dataset processing, model evaluation, simulations, and machine learning demonstrations.

Hugging Face has spearheaded two significant scientific projects, BigScience and BigCode, leading to the development of two large language models: BLOOM and Pygmalion. Moreover, the initiative curates a LLM leaderboard , allowing users to upload and assess models based on text generation tasks.

However, in my opinion, the truly revolutionary aspect is the Hugging Face Hub . This online collaborative space acts as a hub where enthusiasts can explore, experiment, collaborate, and create machine learning-focused technology. Hosting thousands of models, datasets, and demo applications, all transparently accessible, it has quickly become the go-to platform for open AI, similar to GitHub in its field. The Hub’s user-friendly interface encourages effortless reuse and integration, cementing its position as a pivotal element in the open AI landscape.

3. On the Compute Side

3.1 Computing Power

Generative AI extends beyond software, a reality clearly understood by entities like Microsoft/OpenAI and Google as Cloud service providers. They benefit from the ongoing effort to enhance LLM performance through increased computing resources.

In September, AWS announced an investment of $4 billion in Anthropic, whose models will be integrated into Amazon Bedrock, developed, trained, and deployed using AWS AI chips.

NVIDIA is at the forefront of expanding computing capacity with its GPUs. This commitment is reflected in their CUDA (Compute Unified Device Architecture) parallel computing platform and the associated programming model. Additionally, NVIDIA manages the NeMo conversational AI toolkit, which serves both research purposes and as an enterprise-grade framework for handling LLMs and performing GPU-based computations, whether in the Cloud or on personal computers.

The AI chip recently unveiled by Intel signifies a major advancement in user empowerment over LLM applications. Diverging from Cloud providers’ strategies, Intel’s aim is to penetrate the market for AI chips suitable for operations outside data centers, thus enabling the deployment of LLMs on personal desktops, for example, through its OpenVINO open-source framework.

This field is characterized by dynamic competition, with companies like AMD and Alibaba making notable inroads into the AI chip market. The development of proprietary AI chips by Cloud Service Providers (CSPs), such as Google’s TPU and AWS’s Inferentia, as well as investments by startups like SambaNova, Cerebras, and Rain AI (backed by Sam Altman), further intensifies this competition.

3.2 Democratizing AI Computing

LLM Training Steps

The llama.cpp project, initiated by researchers at the University of Freiburg, exemplifies efforts to democratize AI and drive innovation by promoting CPU-compatible models. It’s a C++ implementation of the LLaMA algorithm, originally designed for use on MacBooks and now also supporting x86 architectures. Employing quantization, which reduces LLMs’ size and computational demands by converting model weights from floating-point to integer values, this project created the GGML binary format for efficient model storage and loading. Recently, GGML was succeeded by the more versatile and extensible GGUF format.

Since its launch in October 2022, llama.cpp has quickly gained traction among researchers and developers for its scalability and compatibility across MacOS, Linux, Windows, and Docker.

LLM Training Steps

The GPT4All project, which employs llama.cpp, aims to train and deploy LLMs on conventional hardware. This ecosystem democratizes AI by allowing researchers and developers to train and deploy LLMs on their hardware, thus circumventing costly cloud computing services. This initiative has enhanced the accessibility and affordability of LLMs. GPT4All includes a variety of models derived from GPT-J, MPT, and LLaMA, fine-tuned with datasets including ShareGPT conversations. However, the limitations mentioned earlier are important to consider. The ecosystem also features a chat desktop application, as previously discussed in a related article .

LLM Training Steps

LocalAI is another open-source project designed to facilitate running LLMs locally. Previously introduced in a dedicated article , LocalAI leverages llama.cpp and supports GGML/GGUF formats and Hugging Face models, among others. This tool simplifies deploying an open model on a standard computer or within a container, integrating it into a larger, distributed architecture. The BionicGPT open-source project exemplifies its usage.

Conclusion

The evolution of Generative AI mirrors broader technological advancements, fluctuating between open collaboration and proprietary control. The AI landscape, as illustrated by the rise and diversification of LLMs, stands at a crossroads. On one side, tech giants drive towards commercialization and centralization, leveraging immense AI potential and vast computational resources. On the other, a growing movement champions open AI, emphasizing collaboration, transparency, and democratization.

The LLaMA project, despite its limited openness, has inspired numerous others, as evident in names like Alpaca, Vicuna, Dolly, Goat, Dalai (a play on Dalai Lama), Redpajama, and more. We may soon see other animal-themed projects joining this AI menagerie, perhaps flying alongside Falcon?

The range of projects from LLaMA to Falcon, and platforms like Hugging Face, highlight the vitality of the open AI movement. They remind us that while commercial interests are crucial for innovation, the spirit of shared knowledge and collective growth is equally essential. The “Linux moment” of AI beckons a future where openness entails not just access, but also the freedom to innovate, repurpose, and collaborate.

In our view, the centralization of AI computing power tends to lean towards closed-source solutions, whereas its democratization inherently supports open-source alternatives.

As AI increasingly permeates our lives, the choices we make now—whether to centralize or democratize, to close or open up will shape not just the future of technology, but the essence of our digital society. This moment of reckoning urges us to contemplate the kind of digital world we wish to create and pass on to future generations.