Transformers源码解读系列（二）：pipeline

发布于2024 年 5 月 30 日2024 年 5 月 30 日作者:yufengmuyu

pipline/text_classification.py中定义了TextClassificationPipeline类，该类可以使用任何ModelForSwquenceClassification类，例如：

我们可以像如下方式利用pipeline来调用模型：

>>> from transformers import pipeline
>>> classifier = pipeline(model="distilbert/distilbert-base-uncased-finetuned-sst-2-english")
>>> classifier("This movie is disgustingly good !")
[{'label': 'POSITIVE', 'score': 1.0}]
>>> classifier("Director tried too much.")
[{'label': 'NEGATIVE', 'score': 0.996}]

追着这个调用过程，可以不断挖掘transformers的实现过程
这里首先是调用了pipeline函数，该函数位于pipelines/init.py 第562行：

def pipeline(
    task: str = None,
    model: Optional[Union[str, "PreTrainedModel", "TFPreTrainedModel"]] = None,
    config: Optional[Union[str, PretrainedConfig]] = None,
    tokenizer: Optional[Union[str, PreTrainedTokenizer, "PreTrainedTokenizerFast"]] = None,
    feature_extractor: Optional[Union[str, PreTrainedFeatureExtractor]] = None,
    image_processor: Optional[Union[str, BaseImageProcessor]] = None,
    framework: Optional[str] = None,
    revision: Optional[str] = None,
    use_fast: bool = True,
    token: Optional[Union[str, bool]] = None,
    device: Optional[Union[int, str, "torch.device"]] = None,
    device_map=None,
    torch_dtype=None,
    trust_remote_code: Optional[bool] = None,
    model_kwargs: Dict[str, Any] = None,
    pipeline_class: Optional[Any] = None,
    **kwargs,
) -> Pipeline:

该函数用于构建一个[Pipeline]。
Pipeline主要由三个部分组成：

负责将原始文本输入转换成词元(token)的分词器tokenizer
用于根据输入进行预测的模型model
一些增强模型输出的（可选的）后处理

函数参数：

Args:
task (str):
The task defining which pipeline will be returned. Currently accepted tasks are:

- "audio-classification": will return a [AudioClassificationPipeline].
- "automatic-speech-recognition": will return a [AutomaticSpeechRecognitionPipeline].
- "conversational": will return a [ConversationalPipeline].
- "depth-estimation": will return a [DepthEstimationPipeline].
- "document-question-answering": will return a [DocumentQuestionAnsweringPipeline].
- "feature-extraction": will return a [FeatureExtractionPipeline].
- "fill-mask": will return a [FillMaskPipeline]:.
- "image-classification": will return a [ImageClassificationPipeline].
- "image-feature-extraction": will return an [ImageFeatureExtractionPipeline].
- "image-segmentation": will return a [ImageSegmentationPipeline].
- "image-to-image": will return a [ImageToImagePipeline].
- "image-to-text": will return a [ImageToTextPipeline].
- "mask-generation": will return a [MaskGenerationPipeline].
- "object-detection": will return a [ObjectDetectionPipeline].
- "question-answering": will return a [QuestionAnsweringPipeline].
- "summarization": will return a [SummarizationPipeline].
- "table-question-answering": will return a [TableQuestionAnsweringPipeline].
- "text2text-generation": will return a [Text2TextGenerationPipeline].
- "text-classification" (alias "sentiment-analysis" available): will return a
[TextClassificationPipeline].
- "text-generation": will return a [TextGenerationPipeline]:.
- "text-to-audio" (alias "text-to-speech" available): will return a [TextToAudioPipeline]:.
- "token-classification" (alias "ner" available): will return a [TokenClassificationPipeline].
- "translation": will return a [TranslationPipeline].
- "translation_xx_to_yy": will return a [TranslationPipeline].
- "video-classification": will return a [VideoClassificationPipeline].
- "visual-question-answering": will return a [VisualQuestionAnsweringPipeline].
- "zero-shot-classification": will return a [ZeroShotClassificationPipeline].
- "zero-shot-image-classification": will return a [ZeroShotImageClassificationPipeline].
- "zero-shot-audio-classification": will return a [ZeroShotAudioClassificationPipeline].
- "zero-shot-object-detection": will return a [ZeroShotObjectDetectionPipeline].

model (str or [PreTrainedModel] or [TFPreTrainedModel], *optional*):
The model that will be used by the pipeline to make predictions. This can be a model identifier or an
actual instance of a pretrained model inheriting from [PreTrainedModel] (for PyTorch) or
[TFPreTrainedModel] (for TensorFlow).

If not provided, the default for the task will be loaded.
config (str or [PretrainedConfig], *optional*):
The configuration that will be used by the pipeline to instantiate the model. This can be a model
identifier or an actual pretrained model configuration inheriting from [PretrainedConfig].

If not provided, the default configuration file for the requested model will be used. That means that if
model is given, its default configuration will be used. However, if model is not supplied, this
task's default model's config is used instead.
tokenizer (str or [PreTrainedTokenizer], *optional*):
The tokenizer that will be used by the pipeline to encode data for the model. This can be a model
identifier or an actual pretrained tokenizer inheriting from [PreTrainedTokenizer].

If not provided, the default tokenizer for the given model will be loaded (if it is a string). If model
is not specified or not a string, then the default tokenizer for config is loaded (if it is a string).
However, if config is also not given or not a string, then the default tokenizer for the given task
will be loaded.
feature_extractor (str or [PreTrainedFeatureExtractor], *optional*):
The feature extractor that will be used by the pipeline to encode data for the model. This can be a model
identifier or an actual pretrained feature extractor inheriting from [PreTrainedFeatureExtractor].

Feature extractors are used for non-NLP models, such as Speech or Vision models as well as multi-modal
models. Multi-modal models will also require a tokenizer to be passed.

If not provided, the default feature extractor for the given model will be loaded (if it is a string). If
model is not specified or not a string, then the default feature extractor for config is loaded (if it
is a string). However, if config is also not given or not a string, then the default feature extractor
for the given task will be loaded.
framework (str, *optional*):
The framework to use, either "pt" for PyTorch or "tf" for TensorFlow. The specified framework must be
installed.

If no framework is specified, will default to the one currently installed. If no framework is specified and
both frameworks are installed, will default to the framework of the model, or to PyTorch if no model is
provided.
revision (str, *optional*, defaults to "main"):
When passing a task name or a string model identifier: The specific model version to use. It can be a
branch name, a tag name, or a commit id, since we use a git-based system for storing models and other
artifacts on huggingface.co, so revision can be any identifier allowed by git.
use_fast (bool, *optional*, defaults to True):
Whether or not to use a Fast tokenizer if possible (a [PreTrainedTokenizerFast]).
use_auth_token (str or *bool*, *optional*):
The token to use as HTTP bearer authorization for remote files. If True, will use the token generated
when running huggingface-cli login (stored in ~/.huggingface).
device (int or str or torch.device):
Defines the device (*e.g.*, "cpu", "cuda:1", "mps", or a GPU ordinal rank like 1) on which this
pipeline will be allocated.
device_map (str or Dict[str, Union[int, str, torch.device], *optional*):
Sent directly as model_kwargs (just a simpler shortcut). When accelerate library is present, set
device_map="auto" to compute the most optimized device_map automatically (see
[here](https://huggingface.co/docs/accelerate/main/en/package_reference/big_modeling#accelerate.cpu_offload)
for more information).

Do not use device_map AND device at the same time as they will conflict

torch_dtype (str or torch.dtype, *optional*):
Sent directly as model_kwargs (just a simpler shortcut) to use the available precision for this model
(torch.float16, torch.bfloat16, ... or "auto").
trust_remote_code (bool, *optional*, defaults to False):
Whether or not to allow for custom code defined on the Hub in their own modeling, configuration,
tokenization or even pipeline files. This option should only be set to True for repositories you trust
and in which you have read the code, as it will execute code present on the Hub on your local machine.
model_kwargs (Dict[str, Any], *optional*):
Additional dictionary of keyword arguments passed along to the model's `from_pretrained(...,
**model_kwargs)` function.
kwargs (Dict[str, Any], *optional*):
Additional keyword arguments passed along to the specific pipeline init (see the documentation for the
corresponding pipeline class for possible values).

函数返回：

Returns:
[Pipeline]：合适于任务的pipeline

调用样例：

```python
Examples:
>>> from transformers import pipeline, AutoModelForTokenClassification, AutoTokenizer
>>> # Sentiment analysis pipeline
>>> analyzer = pipeline("sentiment-analysis")
>>> # Question answering pipeline, specifying the checkpoint identifier
>>> oracle = pipeline(
... "question-answering", model="distilbert/distilbert-base-cased-distilled-squad", tokenizer="google-bert/bert-base-cased"
... )
>>> # Named entity recognition pipeline, passing in a specific model and tokenizer
>>> model = AutoModelForTokenClassification.from_pretrained("dbmdz/bert-large-cased-finetuned-conll03-english")
>>> tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-cased")
>>> recognizer = pipeline("ner", model=model, tokenizer=tokenizer)
```

Transformers源码解读系列（二）：pipeline

“Transformers源码解读系列（二）：pipeline”有2条评论。

发表回复取消回复

近期文章

近期评论

归档

分类

“Transformers源码解读系列（二）：pipeline”有2条评论。

发表回复 取消回复

近期文章

近期评论

归档

分类

发表回复取消回复