Example ==================================================== Here we provide a very simple script for supervised finetuning, which is revised from the training script in ```Fastchat`` `__. The script is used to finetune Qwen with Hugging Face Trainer. You can check the script `here `__. This script for supervised finetuning (SFT) has the following features: - Support single-GPU and multi-GPU training; - Support full-parameter tuning, `LoRA `__, and `Q-LoRA `__. In the following, we introduce more details about the usage of the script. Installation ------------ Before you start, make sure you have installed the following packages: .. code:: bash pip install peft deepspeed optimum accelerate Data Preparation ---------------- For data preparation, we advise you to organize the data in a jsonl file, where each line is a dictionary as demonstrated below: .. code:: json { "type": "chatml", "messages": [ { "role": "system", "content": "You are a helpful assistant." }, { "role": "user", "content": "Tell me something about large language models." }, { "role": "assistant", "content": "Large language models are a type of language model that is trained on a large corpus of text data. They are capable of generating human-like text and are used in a variety of natural language processing tasks..." } ], "source": "unknown" } .. code:: json { "type": "chatml", "messages": [ { "role": "system", "content": "You are a helpful assistant." }, { "role": "user", "content": "What is your name?" }, { "role": "assistant", "content": "My name is Qwen." } ], "source": "self-made" } Above are two examples of each data sample in the dataset. Each sample is a JSON object with the following fields: ``type``, ``messages`` and ``source``. ``messages`` is required while the others are optional for you to label your data format and data source. The ``messages`` field is a list of JSON objects, each of which has two fields: ``role`` and ``content``. ``role`` can be ``system``, ``user``, or ``assistant``. ``content`` is the text of the message. ``source`` is the source of the data, which can be ``self-made``, ``alpaca``, ``open-hermes``, or any other string. To make the jsonl file, you can use ``json`` to save a list of dictionaries to the jsonl file: .. code:: python import json with open('data.jsonl', 'w') as f: for sample in samples: f.write(json.dumps(sample) + '\n') Quickstart ---------- For you to start finetuning quickly, we directly provide a shell script for you to run without paying attention to details. You need different hyperparameters for different types of training, e.g., single-GPU / multi-GPU training, full-parameter tuning, LoRA, or Q-LoRA. .. code:: bash cd examples/sft bash finetune.sh -m -d --deepspeed [--use_lora True] [--q_lora True] Specify the ```` for your model, ```` for your data, and ```` for your deepspeed configuration. If you use LoRA or Q-LoRA, just add ``--use_lora True`` or ``--q_lora True`` based on your requirements. This is the simplest way to start finetuning. If you want to change more hyperparameters, you can dive into the script and modify those parameters. Advanced Usages --------------- In this section, we introduce the details of the scripts, including the core python script as well as the corresponding shell script. Shell Script ~~~~~~~~~~~~~ Before we introduce the python code, we provide a brief introduction to the shell script with commands. We provide some guidance inside the shell script and here we take ``finetune.sh`` as an example. To set up the environment variables for distributed training (or single-GPU training), specify the following variables: ``GPUS_PER_NODE``, ``NNODES``, ``NODE_RANK``, ``MASTER_ADDR``, and ``MASTER_PORT``. No need to worry too much about them as we provide the default settings for you. In the command, you can pass in the argument ``-m`` and ``-d`` to specify the model path and data path, respectively. You can also pass in the argument ``--deepspeed`` to specify the deepspeed configuration file. We provide two configuration files for ZeRO2 and ZeRO3, and you can choose one based on your requirements. In most cases, we recommend using ZeRO3 for multi-GPU training except for Q-LoRA, where we recommend using ZeRO2. There are a series of hyperparameters to tune. Passing in ``--bf16`` or ``--fp16`` to specify the precision for mixed precision training. The other significant hyperparameters include: - ``--output_dir``: the path of your output models or adapters. - ``--num_train_epochs``: the number of training epochs. - ``--gradient_accumulation_steps``: the number of gradient accumulation steps. - ``--per_device_train_batch_size``: the batch size per GPU for training, and the total batch size is equalt to ``per_device_train_batch_size`` :math:`\times` ``number_of_gpus`` :math:`\times` ``gradient_accumulation_steps``. - ``--learning_rate``: the learning rate. - ``--warmup_steps``: the number of warmup steps. - ``--lr_scheduler_type``: the type of learning rate scheduler. - ``--weight_decay``: the value of weight decay. - ``--adam_beta2``: the value of :math:`\beta_2` in Adam. - ``--model_max_length``: the maximum sequence length. - ``--use_lora``: whether to use LoRA. Adding ``--q_lora`` can enable Q-LoRA. - ``--gradient_checkpointing``: whether to use gradient checkpointing. Python Script ~~~~~~~~~~~~~ In this script, we mainly use ``trainer`` from HF and ``peft`` to train our models. We also use ``deepspeed`` to accelerate the training process. The script is very simple and easy to understand. .. code:: python @dataclass @dataclass class ModelArguments: model_name_or_path: Optional[str] = field(default="Qwen/Qwen-7B") @dataclass class DataArguments: data_path: str = field( default=None, metadata={"help": "Path to the training data."} ) eval_data_path: str = field( default=None, metadata={"help": "Path to the evaluation data."} ) lazy_preprocess: bool = False @dataclass class TrainingArguments(transformers.TrainingArguments): cache_dir: Optional[str] = field(default=None) optim: str = field(default="adamw_torch") model_max_length: int = field( default=8192, metadata={ "help": "Maximum sequence length. Sequences will be right padded (and possibly truncated)." }, ) use_lora: bool = False @dataclass class LoraArguments: lora_r: int = 64 lora_alpha: int = 16 lora_dropout: float = 0.05 lora_target_modules: List[str] = field( default_factory=lambda: [ "q_proj", "k_proj", "v_proj", "o_proj", "up_proj", "gate_proj", "down_proj", ] ) lora_weight_path: str = "" lora_bias: str = "none" q_lora: bool = False The classes for arguments allow you to specify hyperparameters for model, data, training, and additionally LoRA if you use LoRA or Q-LoRA to train your model. Specifically, ``model-max-length`` is a key hyperparameter that determines your maximum sequence length of your training data. ``LoRAArguments`` includes the hyperparameters for LoRA or Q-LoRA: - ``lora_r``: the rank for LoRA; - ``lora_alpha``: the alpha value for LoRA; - ``lora_dropout``: the dropout rate for LoRA; - ``lora_target_modules``: the target modules for LoRA. By default we tune all linear layers; - ``lora_weight_path``: the path to the weight file for LoRA; - ``lora_bias``: the bias for LoRA; - ``q_lora``: whether to use Q-LoRA. .. code:: python def maybe_zero_3(param): if hasattr(param, "ds_id"): assert param.ds_status == ZeroParamStatus.NOT_AVAILABLE with zero.GatheredParameters([param]): param = param.data.detach().cpu().clone() else: param = param.detach().cpu().clone() return param # Borrowed from peft.utils.get_peft_model_state_dict def get_peft_state_maybe_zero_3(named_params, bias): if bias == "none": to_return = {k: t for k, t in named_params if "lora_" in k} elif bias == "all": to_return = {k: t for k, t in named_params if "lora_" in k or "bias" in k} elif bias == "lora_only": to_return = {} maybe_lora_bias = {} lora_bias_names = set() for k, t in named_params: if "lora_" in k: to_return[k] = t bias_name = k.split("lora_")[0] + "bias" lora_bias_names.add(bias_name) elif "bias" in k: maybe_lora_bias[k] = t for k, t in maybe_lora_bias: if bias_name in lora_bias_names: to_return[bias_name] = t else: raise NotImplementedError to_return = {k: maybe_zero_3(v) for k, v in to_return.items()} return to_return def safe_save_model_for_hf_trainer( trainer: transformers.Trainer, output_dir: str, bias="none" ): """Collects the state dict and dump to disk.""" # check if zero3 mode enabled if deepspeed.is_deepspeed_zero3_enabled(): state_dict = trainer.model_wrapped._zero3_consolidated_16bit_state_dict() else: if trainer.args.use_lora: state_dict = get_peft_state_maybe_zero_3( trainer.model.named_parameters(), bias ) else: state_dict = trainer.model.state_dict() if trainer.args.should_save and trainer.args.local_rank == 0: trainer._save(output_dir, state_dict=state_dict) The method ``safe_save_model_for_hf_trainer``, which uses ``get_peft_state_maybe_zero_3``, helps tackle the problems in saving models trained either with or without ZeRO3. .. code:: python def preprocess( messages, tokenizer: transformers.PreTrainedTokenizer, max_len: int, ) -> Dict: """Preprocesses the data for supervised fine-tuning.""" texts = [] for i, msg in enumerate(messages): texts.append( tokenizer.apply_chat_template( msg, tokenize=True, add_generation_prompt=False, padding=True, max_length=max_len, truncation=True, ) ) input_ids = torch.tensor(texts, dtype=torch.int) target_ids = input_ids.clone() target_ids[target_ids == tokenizer.pad_token_id] = IGNORE_TOKEN_ID attention_mask = input_ids.ne(tokenizer.pad_token_id) return dict( input_ids=input_ids, target_ids=target_ids, attention_mask=attention_mask ) For data preprocessing, we use ``preprocess`` to organize the data. Specifically, we apply our ChatML template to the texts. If you prefer other chat templates, you can use others, e.g., by still applying ``apply_chat_template()`` with another tokenizer. The chat template is stored in the ``tokenizer_config.json`` in the HF repo. Additionally, we pad the sequence of each sample to the maximum length for training. .. code:: python class SupervisedDataset(Dataset): """Dataset for supervised fine-tuning.""" def __init__( self, raw_data, tokenizer: transformers.PreTrainedTokenizer, max_len: int ): super(SupervisedDataset, self).__init__() rank0_print("Formatting inputs...") messages = [example["messages"] for example in raw_data] data_dict = preprocess(messages, tokenizer, max_len) self.input_ids = data_dict["input_ids"] self.target_ids = data_dict["target_ids"] self.attention_mask = data_dict["attention_mask"] def __len__(self): return len(self.input_ids) def __getitem__(self, i) -> Dict[str, torch.Tensor]: return dict( input_ids=self.input_ids[i], labels=self.labels[i], attention_mask=self.attention_mask[i], ) class LazySupervisedDataset(Dataset): """Dataset for supervised fine-tuning.""" def __init__( self, raw_data, tokenizer: transformers.PreTrainedTokenizer, max_len: int ): super(LazySupervisedDataset, self).__init__() self.tokenizer = tokenizer self.max_len = max_len rank0_print("Formatting inputs...Skip in lazy mode") self.tokenizer = tokenizer self.raw_data = raw_data self.cached_data_dict = {} def __len__(self): return len(self.raw_data) def __getitem__(self, i) -> Dict[str, torch.Tensor]: if i in self.cached_data_dict: return self.cached_data_dict[i] ret = preprocess([self.raw_data[i]["messages"]], self.tokenizer, self.max_len) ret = dict( input_ids=ret["input_ids"][0], labels=ret["target_ids"][0], attention_mask=ret["attention_mask"][0], ) self.cached_data_dict[i] = ret return ret def make_supervised_data_module( tokenizer: transformers.PreTrainedTokenizer, data_args, max_len, ) -> Dict: """Make dataset and collator for supervised fine-tuning.""" dataset_cls = ( LazySupervisedDataset if data_args.lazy_preprocess else SupervisedDataset ) rank0_print("Loading data...") train_data = [] with open(data_args.data_path, "r") as f: for line in f: train_data.append(json.loads(line)) train_dataset = dataset_cls(train_data, tokenizer=tokenizer, max_len=max_len) if data_args.eval_data_path: eval_data = [] with open(data_args.eval_data_path, "r") as f: for line in f: eval_data.append(json.loads(line)) eval_dataset = dataset_cls(eval_data, tokenizer=tokenizer, max_len=max_len) else: eval_dataset = None return dict(train_dataset=train_dataset, eval_dataset=eval_dataset) Then we utilize ``make_supervised_data_module`` by using ``SupervisedDataset`` or ``LazySupervisedDataset`` to build the dataset. .. code:: python def train(): global local_rank parser = transformers.HfArgumentParser( (ModelArguments, DataArguments, TrainingArguments, LoraArguments) ) ( model_args, data_args, training_args, lora_args, ) = parser.parse_args_into_dataclasses() # This serves for single-gpu qlora. if ( getattr(training_args, "deepspeed", None) and int(os.environ.get("WORLD_SIZE", 1)) == 1 ): training_args.distributed_state.distributed_type = DistributedType.DEEPSPEED local_rank = training_args.local_rank device_map = None world_size = int(os.environ.get("WORLD_SIZE", 1)) ddp = world_size != 1 if lora_args.q_lora: device_map = {"": int(os.environ.get("LOCAL_RANK") or 0)} if ddp else "auto" if len(training_args.fsdp) > 0 or deepspeed.is_deepspeed_zero3_enabled(): logging.warning("FSDP or ZeRO3 is incompatible with QLoRA.") model_load_kwargs = { "low_cpu_mem_usage": not deepspeed.is_deepspeed_zero3_enabled(), } compute_dtype = ( torch.float16 if training_args.fp16 else (torch.bfloat16 if training_args.bf16 else torch.float32) ) # Load model and tokenizer config = transformers.AutoConfig.from_pretrained( model_args.model_name_or_path, cache_dir=training_args.cache_dir, ) config.use_cache = False model = AutoModelForCausalLM.from_pretrained( model_args.model_name_or_path, config=config, cache_dir=training_args.cache_dir, device_map=device_map, quantization_config=BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=compute_dtype, ) if training_args.use_lora and lora_args.q_lora else None, **model_load_kwargs, ) tokenizer = AutoTokenizer.from_pretrained( model_args.model_name_or_path, cache_dir=training_args.cache_dir, model_max_length=training_args.model_max_length, padding_side="right", use_fast=False, ) if training_args.use_lora: lora_config = LoraConfig( r=lora_args.lora_r, lora_alpha=lora_args.lora_alpha, target_modules=lora_args.lora_target_modules, lora_dropout=lora_args.lora_dropout, bias=lora_args.lora_bias, task_type="CAUSAL_LM", ) if lora_args.q_lora: model = prepare_model_for_kbit_training( model, use_gradient_checkpointing=training_args.gradient_checkpointing ) model = get_peft_model(model, lora_config) # Print peft trainable params model.print_trainable_parameters() if training_args.gradient_checkpointing: model.enable_input_require_grads() # Load data data_module = make_supervised_data_module( tokenizer=tokenizer, data_args=data_args, max_len=training_args.model_max_length ) # Start trainer trainer = Trainer( model=model, tokenizer=tokenizer, args=training_args, **data_module ) # `not training_args.use_lora` is a temporary workaround for the issue that there are problems with # loading the checkpoint when using LoRA with DeepSpeed. # Check this issue https://github.com/huggingface/peft/issues/746 for more information. if ( list(pathlib.Path(training_args.output_dir).glob("checkpoint-*")) and not training_args.use_lora ): trainer.train(resume_from_checkpoint=True) else: trainer.train() trainer.save_state() safe_save_model_for_hf_trainer( trainer=trainer, output_dir=training_args.output_dir, bias=lora_args.lora_bias ) The ``train`` method is the key to the training. In general, it loads the tokenizer and model with ``AutoTokenizer.from_pretrained()`` and ``AutoModelForCausalLM.from_pretrained()``. If we use LoRA, the method will initialize LoRA configuration with ``LoraConfig``. If we apply Q-LoRA, we should use ``prepare_model_for_kbit_training``. Note that for now it still does not support resume for LoRA. Then we leave the following efforts to ``trainer`` and have a cup of coffee! Next Step --------- Now, you are able to use a very simple script to perform different types of SFT. Alternatively, you can use more advanced training libraries, such as `Axolotl `__ or `LLaMA-Factory `__, to enjoy more functionalities. To take a step forward, after SFT, you can consider RLHF to align your model to human preferences! Stay tuned for our next tutorial on RLHF!