We, OLM Research, will fine-tune OpenLM model with supervised instruction-tuning from human feedback (RLHF).
This will ignite OpenLM model's potential on use cases such as building chatbot and rating models for more AI applications.
For funding the fine-tuning, training, and data collection process, we are proposing for 1% of total supply of OLM for the initial funding of this project.
To support imperative user requests and a chat interface, these models often undergo an instruction-tuning step which involves training on supervised input/output pairs.secondly we use reinforcement learning from human feedback to fine-tune our model to follow a broad class of pair data.
To bootstrap, we started the SFT stage with publicly available instruction tuning data , Existing studies have shown that increasing the diversity of instructions can effectively improve the performance. we create two mixtures of datasets: Human data mixture, which comprises the best human-authored datasets, including FLAN V2, CoT, Dolly, and Open Assistant 1 (we exclude SuperNI as FLAN V2 includes most tasks in SuperNI); Human+GPT data mixture, which comprises the human mixture and three additional datasets that have generations by OpenAI GPT models, including GPT4-Alpaca, Code-Alpaca, and ShareGPT.
For supervised fine-tuning, we use a cosine learning rate schedule with an initial learning rate of 2 × 10−5 , a weight decay of 0.1, a batch size of 64, and a sequence length of 4096 tokens. For the fine-tuning process, each sample consists of a prompt and an answer. A special token is utilized to separate the prompt and answer segments. We utilize an autoregressive objective and zero-out the loss on tokens from the user prompt, so as a result, we backpropagate only on answer tokens. Finally, we fine-tune the model for 2 epochs. During training, we compute loss only on tokens after <|assistant|> and before the next <|user|> token. More formally, we consider an instruction dataset as consisting of 𝑁 tuples, each with 𝑖 turns, {(𝑥 𝑗 1 , 𝑦 𝑗 1 , 𝑥 𝑗 2 , 𝑦 𝑗 2 , ...𝑥 𝑗 𝑖 , 𝑦 𝑗 𝑖 )}𝑁 𝑗=1, where 𝑥𝑖 is a user prompt and 𝑦𝑖 the desired output. For most instances, 𝑖 = 1, and we train the model to output 𝑦 𝑗 given 𝑥 𝑗 . However, in the case of conversation datasets, we train the model to predict 𝑦 𝑗 𝑖 given some conversation history 𝑥 𝑗 1 , 𝑦 𝑗 1 , 𝑥 𝑗 2 , ..., 𝑥 𝑗 𝑖 . We train decoder-only models, and use teacher-forcing with loss masking to train the models, where we mask all tokens belonging to the input sequence(s) 𝑥𝑖 . Given 𝑋 as the tokens belonging to the input, and 𝑌 as the target tokens, the loss function is:
Figure 1: The entire sequence is encoded together, and loss is computed on the assistant parts (colored in blue).
Figure 2: the loss of instruction tuning
RLHF is a model training procedure that is applied to a fine-tuned language model to further align model behavior with human preferences and instruction following. We collect data that represents empirically, sampled human preferences, whereby human annotators select which of several model outputs they prefer.
We explored RLHF fine-tuning with two main algorithms:
We RLHF our model iteratively , until first four rounds of training, we used only Rejection Sampling fine-tuning, and after that, we combined the two sequentially, applying DPO on top of the resulted Rejection Sampling checkpoint before sampling again.
Training Details: We train for one epoch over the training data by Rejection Sampling fine-tuning and two epoch over pairs data by DPO. We use the same optimizer parameters as for the base model. The maximum learning rate is 1 × 10−5 for 7b model. The learning rate is decreased on a cosine learning rate schedule, down to 10% of the maximum learning rate. We use a warm-up of 3% of the total number of steps, with a minimum of 5. The effective batch size is kept fixed at 512 pairs, or 1024 rows per batch.
The evaluation of the models will be available soon, demonstrating the enhanced performance and alignment with user preferences achieved through our fine-tuning methodologies.
We appreciate the community's support in advancing the capabilities and responsiveness of the OLM community.