R&D of OLM Instruction Fine-Tuning Models

Voting ended over 1 year agoSucceeded

Overview

We, OLM Research, will fine-tune OpenLM model with supervised instruction-tuning from human feedback (RLHF).

This will ignite OpenLM model's potential on use cases such as building chatbot and rating models for more AI applications.

For funding the fine-tuning, training, and data collection process, we are proposing for 1% of total supply of OLM for the initial funding of this project.

Technical Details

Introduction

To support imperative user requests and a chat interface, these models often undergo an instruction-tuning step which involves training on supervised input/output pairs.secondly we use reinforcement learning from human feedback to fine-tune our model to follow a broad class of pair data.

To bootstrap, we started the SFT stage with publicly available instruction tuning data , Existing studies have shown that increasing the diversity of instructions can effectively improve the performance. we create two mixtures of datasets: Human data mixture, which comprises the best human-authored datasets, including FLAN V2, CoT, Dolly, and Open Assistant 1 (we exclude SuperNI as FLAN V2 includes most tasks in SuperNI); Human+GPT data mixture, which comprises the human mixture and three additional datasets that have generations by OpenAI GPT models, including GPT4-Alpaca, Code-Alpaca, and ShareGPT.

Fine-tuning Details

For supervised fine-tuning, we use a cosine learning rate schedule with an initial learning rate of 2 × 10−5 , a weight decay of 0.1, a batch size of 64, and a sequence length of 4096 tokens. For the fine-tuning process, each sample consists of a prompt and an answer. A special token is utilized to separate the prompt and answer segments. We utilize an autoregressive objective and zero-out the loss on tokens from the user prompt, so as a result, we backpropagate only on answer tokens. Finally, we fine-tune the model for 2 epochs. During training, we compute loss only on tokens after <|assistant|> and before the next <|user|> token. More formally, we consider an instruction dataset as consisting of 𝑁 tuples, each with 𝑖 turns, {(𝑥 𝑗 1 , 𝑦 𝑗 1 , 𝑥 𝑗 2 , 𝑦 𝑗 2 , ...𝑥 𝑗 𝑖 , 𝑦 𝑗 𝑖 )}𝑁 𝑗=1, where 𝑥𝑖 is a user prompt and 𝑦𝑖 the desired output. For most instances, 𝑖 = 1, and we train the model to output 𝑦 𝑗 given 𝑥 𝑗 . However, in the case of conversation datasets, we train the model to predict 𝑦 𝑗 𝑖 given some conversation history 𝑥 𝑗 1 , 𝑦 𝑗 1 , 𝑥 𝑗 2 , ..., 𝑥 𝑗 𝑖 . We train decoder-only models, and use teacher-forcing with loss masking to train the models, where we mask all tokens belonging to the input sequence(s) 𝑥𝑖 . Given 𝑋 as the tokens belonging to the input, and 𝑌 as the target tokens, the loss function is:

Figure 1: The entire sequence is encoded together, and loss is computed on the assistant parts (colored in blue).

Figure 2: the loss of instruction tuning

RLHF is a model training procedure that is applied to a fine-tuned language model to further align model behavior with human preferences and instruction following. We collect data that represents empirically, sampled human preferences, whereby human annotators select which of several model outputs they prefer.

We explored RLHF fine-tuning with two main algorithms:

Direct Policy Optimization (DPO) (Rafailov 2023) exceeds PPO-based RLHF in ability to control sentiment of generations, and matches or improves response quality in multi-turn dialogue while being substantially simpler to implement and train.
Rejection Sampling fine-tuning. We sample K outputs from the model and select the best candidate with our reward, . The same re-ranking strategy for LLMs was also proposed , where the reward is seen as an energy function. Here, we go one step further, and use the selected outputs for a gradient update. For each prompt, the sample obtaining.

We RLHF our model iteratively , until first four rounds of training, we used only Rejection Sampling fine-tuning, and after that, we combined the two sequentially, applying DPO on top of the resulted Rejection Sampling checkpoint before sampling again.

Training Details: We train for one epoch over the training data by Rejection Sampling fine-tuning and two epoch over pairs data by DPO. We use the same optimizer parameters as for the base model. The maximum learning rate is 1 × 10−5 for 7b model. The learning rate is decreased on a cosine learning rate schedule, down to 10% of the maximum learning rate. We use a warm-up of 3% of the total number of steps, with a minimum of 5. The effective batch size is kept fixed at 512 pairs, or 1024 rows per batch.

Conclusion

The evaluation of the models will be available soon, demonstrating the enhanced performance and alignment with user preferences achieved through our fine-tuning methodologies.

We appreciate the community's support in advancing the capabilities and responsiveness of the OLM community.

Off-Chain Vote

For

6.17M OLM100%

Against

0 OLM0%

Abstain

0 OLM0%

Download mobile app to vote

Timeline

Jun 06, 2024Proposal created

Jun 06, 2024Proposal vote started

Jun 13, 2024Proposal vote ended

Dec 06, 2024Proposal updated