Readme

入门第二弹

官方文档:https://huggingface.co/docs/transformers/index

GET STARTED

Pipeline

pipeline()是利用预训练模型进行推理的最简单的方式,创建一个pipeline()实例并且指定想要将它用于的任务,就可以开始了。

在pipeline中使用另一个模型和分词器:

可以使用Hub上的标记来筛选出合适的模型,使用 AutoModelForSequenceClassificationAutoTokenizer来加载预训练模型和它关联的分词器(见下一节)

最后在pipeline()中指定模型和分词器即可。

微调教程待学

AutoClass

AutoClass是一个能够通过预训练模型的名称或路径自动查找其架构的快捷方式. 你只需要为你的任务选择合适的 AutoClass 和它关联的预处理类。

  • AutoTokenizer

用来加载分词器,分词器负责预处理文本,将文本转换为用于输入模型的数字数组。

步骤:

使用AutoTokenizer加载一个分词器

将文本传入分词器

1
2
3
4
from transformers import AutoTokenizer
model_name="nlptown/bert-base-multilingual-uncased-sentiment"
tokenizer=AutoTokenizer.from_pretrained(model_name)
encoding=tokenizer("We are very happy...")

分词器会返回含有如下内容的字典:

1.input_ids:用数字表示的token 2.attention_mask:应该关注哪些token的提示

  • AutoModel

TUTORIALS

Pipelines for inference

Choose a model and tokenizer

The pipeline() loads a default model and a preprocessing class capable of inference for your task. 如the text-generation task(见下面代码)

直接从AutoTokenizer和AutoModelForCausalLM里load即可,如

1
2
3
4
5
6
7
>>>from transformers import AutoTokenizer, AutoModelForCausalLM
>>>from transformers import pipeline
>>>tokenizer=AutoTokenizer.from_pretrained("distilgpt2")
>>>model=AutoModelForCausalLM.from_pretrained("distilgpt2")
>>>generator=pipeline(task="text-generation",model=model,tokenizer=tokenizer)
>>>generator("Three Rings for the Elven-kings under the sky, Seven for the Dwarf-lords in their halls of stone")
[{'generated_text': 'Three Rings for the Elven-kings under the sky, Seven for the Dwarf-lords in their halls of stone, Seven for the Dragon-lords (for them to rule in a world ruled by their rulers, and all who live within the realm'}]

Audio pipeline

The pipeline() also supports audio tasks like audio classification and automatic speech recognition.

For example, classify the emotion.

1
2
3
4
5
6
7
8
9
10
11
>>>from datasets import load_dataset 
#load_dataset函数从HuggingFace Hub或本地数据集文件中加载一个数据集
>>>import torch
>>>torch.manual_seed(42)
>>>ds=load_dataset("...")
>>>audio_file=ds[0]["audio"]["path"]
>>>from transformers import pipeline
>>>audio_classifier=pipeline(task="audio-classification",model="...")
>>>preds=audio_classifier(audio_file)
>>>preds=[{"score": round(pred["score"], 4), "label": pred["label"]} for pred in preds]
>>>preds

Vision pipeline

Specify your task and pass your image to the classifier. The image can be a link or a local path to the image.

1
2
3
4
5
>>>from transformers import pipeline
>>>vision_classifier=pipeline(task="image-classification")
>>>preds=vision_classifier(images="...")
>>>preds=[{"score": round(pred["score"], 4), "label": pred["label"]} for pred in preds]
>>>preds

Multimodal pipeline

The pipeline() supports more than one modality. For example, a visual question answering (VQA) task combines text and image.

1
2
>>>image="..."
>>>question="Where is the cat?"

create a pipeline for VQA and pass it the image and question:

1
2
3
4
5
>>>from transformers import pipeline
>>>vqa=pipeline(task="vqa")
>>>preds=vqa(image=image,question=question)
>>>preds=[{"score": round(pred["score"], 4), "label": pred["label"]} for pred in preds]
>>>preds

Load pretrained instances with an AutoClass

The from_pretrained() method lets you quickly load a pretrained model for any architecture so you don’t have to devote time and resources to train a model from scratch.

AutoTokenizer

Nearly every NLP task begins with a tokenizer. A tokenizer converts your input into a format that can be processed by the model.

1
2
3
4
>>>from transformers import AutoTokenizer
>>>tokenizer=AutoTokenizer.from_pretrained("bert-base-uncased")
>>>sequence="In a hole in the ground there lived a hobbit."
>>>print(tokenizer(sequence))

AutoFeatureExtractor

For audio and vision tasks, a feature extractor processes the audio signal or image into the correct input format.

1
2
3
4
5
>>>from transformers import AutoFeatureExtractor

>>>feature_extractor = AutoFeatureExtractor.from_pretrained(
"ehcalabres/wav2vec2-lg-xlsr-en-speech-emotion-recognition"
)

AutoProcessor

Multimodal tasks require a processor that combines two types of preprocessing tools. For example, the LayoutLMV2 model requires a feature extractor to handle images and a tokenizer to handle text.

1
2
3
>>>from transformers import AutoProcessor

>>>processor = AutoProcessor.from_pretrained("microsoft/layoutlmv2-base-uncased")

AutoModel

(Pytorch) the AutoModelFor classes let you load a pretrained model for a given task

(TensorFlow) the TFAutoModelFor classes let you load a pretrained model for a given task

Preprocess

需要将data(text,images,audio…)转化为tensors

Pad

保证句子一样长。

Padding is a strategy for ensuring tensors are rectangular by adding a special padding token to shorter sentences.(用零来补位)

Set the padding parameter to True to pad the shorter sequences in the batch to match the longest sequence:

1
2
3
4
5
6
7
>>>batch_sentences = [
"But what about second breakfast?",
"Don't think he knows about second breakfast, Pip.",
"What about elevensies?",
]
>>>encoded_input = tokenizer(batch_sentences, padding=True)
>>>print(encoded_input)

Truncation

与pad相反,truncation是用来解决句子过长的问题。

Set the truncation parameter to True to truncate a sequence to the maximum length accepted by the model:

1
2
3
4
5
6
7
>>>batch_sentences = [
"But what about second breakfast?",
"Don't think he knows about second breakfast, Pip.",
"What about elevensies?",
]
>>>encoded_input = tokenizer(batch_sentences, padding=True, truncation=True)
>>>print(encoded_input)

Build tensors

Finally, you want the tokenizer to return the actual tensors that get fed to the model.

Set the return_tensors parameter to either pt for PyTorch, or tf for TensorFlow:

1
2
3
4
#pytorch
>>>encoded_input = tokenizer(batch_sentences, padding=True, truncation=True, return_tensors="pt")
#tensorflow
>>>encoded_input = tokenizer(batch_sentences, padding=True, truncation=True, return_tensors="tf")

Audio

use feature extractor to prepare the dataset for the model

For example, load the MInDS-14 dataset

Access the first element of the audio column to take a look at the input

1
2
3
4
5
6
>>>from datasets import load_dataset, Audio
>>>dataset=load_dataset("PolyAI/minds14",name="en-US",split="train")
>>>dataset[0]["audio"]
{'array':...
'path':...
'sampling_rate':...}

This returns three items:

  • array is the speech signal loaded - and potentially resampled - as a 1D array.
  • path points to the location of the audio file.
  • sampling_rate refers to how many data points in the speech signal are measured per second.

It is important your audio data’s sampling rate matches the sampling rate of the dataset used to pretrain the model. Otherwise, use cast_column to unsample the sampling rate.For example:

1
>>>dataset=dataset.cast_column("audio",Audio(sampling_rate=16_000))

处理文字数据时给短句子添加0,而feature extractor也是添加0(被视为silence)给array

Create a function to preprocess the dataset so the audio samples are the same lengths. Specify a maximum sample length, and the feature extractor will either pad or truncate the sequences to match it:

1
2
3
4
5
6
7
8
9
10
11
>>>def preprocess_function(examples):
audio_arrays = [x["array"] for x in examples["audio"]]
inputs = feature_extractor(
audio_arrays,
sampling_rate=16000,
padding=True,
max_length=100000,
truncation=True,
)
return inputs
>>>processed_dataset = preprocess_function(dataset[:5])

Computer vision

For example, load food101 dataset

1
2
>>>from datasets import load_dataset
>>>dataset = load_dataset("food101", split="train[:100]")

For computer vision tasks, it is common to add some type of data augmentation to the images as a part of preprocessing.

In this example, we use transforms module:

1.Normalize the image

2.Create a function that generates pixel_values from the transforms and use pixel_values as the model’s input.

3.use set_transform to apply the transforms ont the fly

1
2
3
4
5
6
7
8
9
10
11
12
13
>>>from torchvision.transforms import Compose, Normalize, RandomResizedCrop, ColorJitter, ToTensor
>>>normalize = Normalize(mean=feature_extractor.image_mean, std=feature_extractor.image_std)
>>>_transforms = Compose(
[RandomResizedCrop(feature_extractor.size), ColorJitter(brightness=0.5, hue=0.5), ToTensor(), normalize]
)
>>>def transforms(examples):
examples["pixel_values"] = [_transforms(image.convert("RGB")) for image in examples["image"]]
return examples
>>>dataset.set_transform(transforms)
>>>dataset[0]["image"]
{'image':...
'label':...
'pixel_values':...}

Multimodal

use processor to prepare the dataset for the model

for automatic speech recognition(ASR), load the LJ Speech dataset

remember you should always resample your audio dataset’s sampling rate to match the sampling rate of the dataset used to pretrain a model.

1
2
3
>>>lj_speech=load_dataset("lj_speech",split="train")
>>>lj_speech = lj_speech.map(remove_columns=["file", "id", "normalized_text"]) #移去不关心的列
>>>lj_speech = lj_speech.cast_column("audio", Audio(sampling_rate=16_000))

1.load the processor

2.create a function to process the audio data contained in array to input_values, and tokenize text to labels.

3.apply the prepare_dataset function to a sample

1
2
3
4
5
6
7
8
9
>>>from transformers import AutoProcessor
>>>processor = AutoProcessor.from_pretrained("facebook/wav2vec2-base-960h")

>>>def prepare_dataset(example):
audio = example["audio"]
example.update(processor(audio=audio["array"], text=example["text"], sampling_rate=16000))
return example

>>>prepare_dataset(lj_speech[0])

so you can pass your processed dataset to the model.

Fine-tune(微调) a pretrained model

fine-tune:train a pretrained model on a dataset specific to the task

Prepare a dataset

1
2
3
4
5
6
7
8
>>>from datasets import load_dataset
>>>dataset=load_dataset("yelp_review_full")

>>>from transformers import AutoTokenizer
>>>tokenizer=AutoTokenizer.from_pretrained("bert-base-cased")
>>>def tokenize_function(examples):
return tokenizer(examples["text"],padding="max_length",truncation=True)
>>>tokenizer_datasets=dataset.map(tokenize_function,batch=True) #use map method to process the dataset in one step

Train

Train with PyTorch Trainer

Start by loading your model and specify the number of expected labels.

1
2
3
>>>from transformers import AutoModelForSequenceClassification

>>>model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=5)

1.Training hyperparameters

Next, create a TrainingArguments class which contains all the hyperparameters(高参数) you can tune as well as flags for activating different training options.

1
2
>>>from transformers import TrainingArguments
>>>training_args=TrainingArguments(output_dir="test_trainer")

2.Evaluate

Trainer does not automatically evaluate model performance during training. You’ll need to pass Trainer a function to compute and report metrics.

The Evaluate library provides a simple accuracy function you can load with the evaluate.load function:

1
2
3
>>>import numpy as np
>>>import evaluate
>>>metric = evaluate.load("accuracy")

Call compute on metric to calculate the accuracy of your predictions.

Before passing your predictions to compute, you need to convert the predictions to logits:

1
2
3
4
>>>def compute_metrics(eval_pred):
logits, labels = eval_pred
predictions = np.argmax(logits, axis=-1)
return metric.compute(predictions=predictions, references=labels)

If you’d like to monitor your evaluation metrics during fine-tuning, specify the evaluation_strategy parameter:

1
2
>>>from transformers import TrainingArguments, Trainer
>>>training_args = TrainingArguments(output_dir="test_trainer", evaluation_strategy="epoch")

3.Trainer

Create a Trainer object with your model, training arguments, training and test datasets, and evaluation function:

1
2
3
4
5
6
7
>>>trainer = Trainer(
model=model,
args=training_args,
train_dataset=small_train_dataset,
eval_dataset=small_eval_dataset,
compute_metrics=compute_metrics,
)

Then fine-tune your model by calling train():

1
>>>trainer.train()

Train a TensorFlow model with Keras

1.Loading data for Keras (work great for small datasets)

You need to convert the dataset to a fomat that Keras understands. If your dataset is small, you can just convert the whole thing to NumPy arrays and pass it to Keras.

First, load a dataset.

1
2
3
4
from datasets import load_dataset

dataset = load_dataset("glue", "cola")
dataset = dataset["train"] # Just take the training split for now

Next, load a tokenizer and tokenize the data as NumPy arrays.

1
2
3
4
5
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
tokenized_data = tokenizer(dataset["text"], return_tensors="np", padding=True)
labels = np.array(dataset["label"]) # Label is already an array of 0 and 1

Finally, load, compile and fit the model.

1
2
3
4
5
6
7
8
9
from transformers import TFAutoModelForSequenceClassification
from tensorflow.keras.optimizers import Adam

# Load and compile our model
model = TFAutoModelForSequenceClassification.from_pretrained("bert-base-cased")
# Lower learning rates are often better for fine-tuning transformers
model.compile(optimizer=Adam(3e-5))

model.fit(tokenized_data, labels)

2.Loading data as a tf.data.Dataset (for large datasets to avoid slowing down)

  • prepare_tf_dataset(): This is the method we recommend in most cases. Because it is a method on your model, it can inspect the model to automatically figure out which columns are usable as model inputs, and discard the others to make a simpler, more performant dataset.
  • to_tf_dataset: This method is more low-level, and is useful when you want to exactly control how your dataset is created, by specifying exactly which columns and label_cols to include.

For example:

tf_dataset = model.prepare_tf_dataset(dataset, batch_size=16, shuffle=True, tokenizer=tokenizer)

Train in native PyTorch

1.DataLoader

2.Optimizer and learning rate scheduler

3.Training loop (前面都是准备阶段,这里开始train)

4.Evaluate

不想写了,以后如果需要直接看官方文档吧orz

Additional resources

refer to:

Distributed training with Accelerate

Setup

install Accelerate:

1
pip install accelerate

import and create an Accelerator object:

1
2
>>>from accelerate import Accelerator
>>>accelerator=Accelerator()

Prepare to accelerate

pass all the relevant training objects to the prepare method

This includes your training and evaluation DataLoaders, a model and an optimizer:

1
2
3
>>>train_dataloader, eval_dataloader, model, optimizer = accelerator.prepare(
train_dataloader, eval_dataloader, model, optimizer
)

Backward

最后添加的是用 Accelerate 的 backward 方法替换训练循环中典型的 loss.backward() :

只需在上一讲“Train in native pytorch”的代码中加四行即可:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
+ from accelerate import Accelerator
from transformers import AdamW, AutoModelForSequenceClassification, get_scheduler

+ accelerator = Accelerator()

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)
optimizer = AdamW(model.parameters(), lr=3e-5)

- device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
- model.to(device)

+ train_dataloader, eval_dataloader, model, optimizer = accelerator.prepare(
+ train_dataloader, eval_dataloader, model, optimizer
+ )

num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
"linear",
optimizer=optimizer,
num_warmup_steps=0,
num_training_steps=num_training_steps
)

progress_bar = tqdm(range(num_training_steps))

model.train()
for epoch in range(num_epochs):
for batch in train_dataloader:
- batch = {k: v.to(device) for k, v in batch.items()}
outputs = model(**batch)
loss = outputs.loss
- loss.backward()
+ accelerator.backward(loss)

optimizer.step()
lr_scheduler.step()
optimizer.zero_grad()
progress_bar.update(1)

Train

Once you’ve added the relevant lines of code, launch your training in a script or a notebook like Colaboratory.

Train with a script

run the following command to create and save a configuration file:

1
accelerate config

Then launch your training with:

1
accelerate launch train.py

Train with a notebook

Wrap all the code responsible for training in a function, and pass it to notebook_launcher:

1
2
>>>from accelerate import notebook_launcher
>>>notebook_launcher(training_function)

Share a model

In this titorial, you ‘ll learn two methods for sharing a trained or fine-tuned model on the Model Hub:

  • Programmatically push your files to the Hub.
  • Drag-and-drop your files to the Hub with the web interface.

(主要讲怎样分享你的模型,与将代码push到github上类似)

直接参考即可:

https://huggingface.co/docs/transformers/model_sharing

Feelings

其实transformers给我的感觉和pytorch差不多,没有接触时感觉很高深,看别人学的时候觉得别人好厉害,觉得这方面的任务比如语音识别文字续写等等都好神奇……

真正学一遍,可能学的不是很深,还需要在实践中进一步巩固,但学过后就觉得其实本质都是那几个步骤,前辈们已经帮我们把基本框架搭好了,就看我们怎么运用和创新了。