Transformers🤗
Readme
入门第二弹
官方文档:https://huggingface.co/docs/transformers/index
GET STARTED
Pipeline
pipeline()是利用预训练模型进行推理的最简单的方式,创建一个pipeline()实例并且指定想要将它用于的任务,就可以开始了。
在pipeline中使用另一个模型和分词器:
可以使用Hub上的标记来筛选出合适的模型,使用 AutoModelForSequenceClassification
和AutoTokenizer
来加载预训练模型和它关联的分词器(见下一节)
最后在pipeline()中指定模型和分词器即可。
微调教程待学
AutoClass
AutoClass是一个能够通过预训练模型的名称或路径自动查找其架构的快捷方式. 你只需要为你的任务选择合适的 AutoClass
和它关联的预处理类。
- AutoTokenizer
用来加载分词器,分词器负责预处理文本,将文本转换为用于输入模型的数字数组。
步骤:
使用AutoTokenizer加载一个分词器
将文本传入分词器
1 | from transformers import AutoTokenizer |
分词器会返回含有如下内容的字典:
1.input_ids:用数字表示的token 2.attention_mask:应该关注哪些token的提示
- AutoModel
TUTORIALS
Pipelines for inference
Choose a model and tokenizer
The pipeline() loads a default model and a preprocessing class capable of inference for your task. 如the text-generation task(见下面代码)
直接从AutoTokenizer和AutoModelForCausalLM里load即可,如
1 | >>>from transformers import AutoTokenizer, AutoModelForCausalLM |
Audio pipeline
The pipeline() also supports audio tasks like audio classification and automatic speech recognition.
For example, classify the emotion.
1 | >>>from datasets import load_dataset |
Vision pipeline
Specify your task and pass your image to the classifier. The image can be a link or a local path to the image.
1 | >>>from transformers import pipeline |
Multimodal pipeline
The pipeline() supports more than one modality. For example, a visual question answering (VQA) task combines text and image.
1 | >>>image="..." |
create a pipeline for VQA and pass it the image and question:
1 | >>>from transformers import pipeline |
Load pretrained instances with an AutoClass
The from_pretrained()
method lets you quickly load a pretrained model for any architecture so you don’t have to devote time and resources to train a model from scratch.
AutoTokenizer
Nearly every NLP task begins with a tokenizer. A tokenizer converts your input into a format that can be processed by the model.
1 | >>>from transformers import AutoTokenizer |
AutoFeatureExtractor
For audio and vision tasks, a feature extractor processes the audio signal or image into the correct input format.
1 | >>>from transformers import AutoFeatureExtractor |
AutoProcessor
Multimodal tasks require a processor that combines two types of preprocessing tools. For example, the LayoutLMV2 model requires a feature extractor to handle images and a tokenizer to handle text.
1 | >>>from transformers import AutoProcessor |
AutoModel
(Pytorch) the AutoModelFor
classes let you load a pretrained model for a given task
(TensorFlow) the TFAutoModelFor
classes let you load a pretrained model for a given task
Preprocess
需要将data(text,images,audio…)转化为tensors
Pad
保证句子一样长。
Padding is a strategy for ensuring tensors are rectangular by adding a special padding token to shorter sentences.(用零来补位)
Set the padding
parameter to True
to pad the shorter sequences in the batch to match the longest sequence:
1 | >>>batch_sentences = [ |
Truncation
与pad相反,truncation是用来解决句子过长的问题。
Set the truncation
parameter to True
to truncate a sequence to the maximum length accepted by the model:
1 | >>>batch_sentences = [ |
Build tensors
Finally, you want the tokenizer to return the actual tensors that get fed to the model.
Set the return_tensors
parameter to either pt
for PyTorch, or tf
for TensorFlow:
1 | #pytorch |
Audio
use feature extractor to prepare the dataset for the model
For example, load the MInDS-14 dataset
Access the first element of the audio column to take a look at the input
1 | >>>from datasets import load_dataset, Audio |
This returns three items:
array
is the speech signal loaded - and potentially resampled - as a 1D array.path
points to the location of the audio file.sampling_rate
refers to how many data points in the speech signal are measured per second.
It is important your audio data’s sampling rate matches the sampling rate of the dataset used to pretrain the model. Otherwise, use cast_column
to unsample the sampling rate.For example:
1 | >>>dataset=dataset.cast_column("audio",Audio(sampling_rate=16_000)) |
处理文字数据时给短句子添加0,而feature extractor也是添加0(被视为silence)给array
Create a function to preprocess the dataset so the audio samples are the same lengths. Specify a maximum sample length, and the feature extractor will either pad or truncate the sequences to match it:
1 | >>>def preprocess_function(examples): |
Computer vision
For example, load food101 dataset
1 | >>>from datasets import load_dataset |
For computer vision tasks, it is common to add some type of data augmentation to the images as a part of preprocessing.
In this example, we use transforms
module:
1.Normalize the image
2.Create a function that generates pixel_values
from the transforms and use pixel_values
as the model’s input.
3.use set_transform
to apply the transforms ont the fly
1 | >>>from torchvision.transforms import Compose, Normalize, RandomResizedCrop, ColorJitter, ToTensor |
Multimodal
use processor
to prepare the dataset for the model
for automatic speech recognition(ASR), load the LJ Speech dataset
remember you should always resample your audio dataset’s sampling rate to match the sampling rate of the dataset used to pretrain a model.
1 | >>>lj_speech=load_dataset("lj_speech",split="train") |
1.load the processor
2.create a function to process the audio data contained in array to input_values, and tokenize text to labels.
3.apply the prepare_dataset function to a sample
1 | >>>from transformers import AutoProcessor |
so you can pass your processed dataset to the model.
Fine-tune(微调) a pretrained model
fine-tune
:train a pretrained model on a dataset specific to the task
Prepare a dataset
1 | >>>from datasets import load_dataset |
Train
Train with PyTorch Trainer
Start by loading your model and specify the number of expected labels.
1 | >>>from transformers import AutoModelForSequenceClassification |
1.Training hyperparameters
Next, create a TrainingArguments
class which contains all the hyperparameters(高参数) you can tune as well as flags for activating different training options.
1 | >>>from transformers import TrainingArguments |
2.Evaluate
Trainer
does not automatically evaluate model performance during training. You’ll need to pass Trainer
a function to compute and report metrics.
The Evaluate
library provides a simple accuracy
function you can load with the evaluate.load
function:
1 | >>>import numpy as np |
Call compute
on metric
to calculate the accuracy of your predictions.
Before passing your predictions to compute
, you need to convert the predictions to logits:
1 | >>>def compute_metrics(eval_pred): |
If you’d like to monitor your evaluation metrics during fine-tuning, specify the evaluation_strategy
parameter:
1 | >>>from transformers import TrainingArguments, Trainer |
3.Trainer
Create a Trainer
object with your model, training arguments, training and test datasets, and evaluation function:
1 | >>>trainer = Trainer( |
Then fine-tune your model by calling train()
:
1 | >>>trainer.train() |
Train a TensorFlow model with Keras
1.Loading data for Keras (work great for small datasets)
You need to convert the dataset to a fomat that Keras understands. If your dataset is small, you can just convert the whole thing to NumPy arrays and pass it to Keras.
First, load a dataset.
1 | from datasets import load_dataset |
Next, load a tokenizer and tokenize the data as NumPy arrays.
1 | from transformers import AutoTokenizer |
Finally, load, compile and fit the model.
1 | from transformers import TFAutoModelForSequenceClassification |
2.Loading data as a tf.data.Dataset (for large datasets to avoid slowing down)
prepare_tf_dataset()
: This is the method we recommend in most cases. Because it is a method on your model, it can inspect the model to automatically figure out which columns are usable as model inputs, and discard the others to make a simpler, more performant dataset.to_tf_dataset
: This method is more low-level, and is useful when you want to exactly control how your dataset is created, by specifying exactly whichcolumns
andlabel_cols
to include.
For example:
tf_dataset = model.prepare_tf_dataset(dataset, batch_size=16, shuffle=True, tokenizer=tokenizer)
Train in native PyTorch
1.DataLoader
2.Optimizer and learning rate scheduler
3.Training loop (前面都是准备阶段,这里开始train)
4.Evaluate
不想写了,以后如果需要直接看官方文档吧orz
Additional resources
refer to:
- 🤗 Transformers Examples includes scripts to train common NLP tasks in PyTorch and TensorFlow.
- 🤗 Transformers Notebooks contains various notebooks on how to fine-tune a model for specific tasks in PyTorch and TensorFlow.
Distributed training with Accelerate
Setup
install Accelerate:
1 | pip install accelerate |
import and create an Accelerator object:
1 | >>>from accelerate import Accelerator |
Prepare to accelerate
pass all the relevant training objects to the prepare method
This includes your training and evaluation DataLoaders, a model and an optimizer:
1 | >>>train_dataloader, eval_dataloader, model, optimizer = accelerator.prepare( |
Backward
最后添加的是用 Accelerate 的 backward 方法替换训练循环中典型的 loss.backward() :
只需在上一讲“Train in native pytorch”的代码中加四行即可:
1 | + from accelerate import Accelerator |
Train
Once you’ve added the relevant lines of code, launch your training in a script or a notebook like Colaboratory.
Train with a script
run the following command to create and save a configuration file:
1 | accelerate config |
Then launch your training with:
1 | accelerate launch train.py |
Train with a notebook
Wrap all the code responsible for training in a function, and pass it to notebook_launcher:
1 | >>>from accelerate import notebook_launcher |
Share a model
In this titorial, you ‘ll learn two methods for sharing a trained or fine-tuned model on the Model Hub:
- Programmatically push your files to the Hub.
- Drag-and-drop your files to the Hub with the web interface.
(主要讲怎样分享你的模型,与将代码push到github上类似)
直接参考即可:
https://huggingface.co/docs/transformers/model_sharing
Feelings
其实transformers给我的感觉和pytorch差不多,没有接触时感觉很高深,看别人学的时候觉得别人好厉害,觉得这方面的任务比如语音识别文字续写等等都好神奇……
真正学一遍,可能学的不是很深,还需要在实践中进一步巩固,但学过后就觉得其实本质都是那几个步骤,前辈们已经帮我们把基本框架搭好了,就看我们怎么运用和创新了。