Monday, October 6, 2025

Be taught How one can Use Transformers with HuggingFace and SpaCy


Introduction

the the state-of-the-art structure for NLP and never solely. Trendy fashions like ChatGPT, Llama, and Gemma are based mostly on this structure launched in 2017 within the Consideration Is All You Want paper from Vaswani et al.

Within the earlier article, we noticed find out how to use spaCy to perform a number of duties, and also you might need seen that we by no means needed to prepare something, however we leveraged spaCy capabilities, that are primarily rule-based approaches.

SpaCy additionally gives to insert within the NLP pipeline trainable elements or to make use of fashions off the shelf from the 🤗 HuggingFace Hub, which is a web based platform that gives open-source fashions for AI builders to make use of.

So let’s learn to use SpaCy with Hugging Face’s fashions!

Why Transformers?

Earlier than transformers the SOTA structure to create vector representations of phrases was phrase vectors methods. A phrase vector is a dense illustration of a phrase, which we are able to use to carry out some mathematical calculation on it.

For instance, we are able to observe that two phrases which have an identical which means even have related vectors. Essentially the most well-known methods of this sort are GloVe and FastText.

These strategies, although, have launched an enormous drawback, a phrase is represented all the time by the identical vector. However a phrase doesn’t all the time have the identical which means.

For instance:

  • “She went to the financial institution to withdraw some cash.”
  • “He sat by the financial institution of the river, watching the water circulate.”

In these two sentences, the phrase financial institution assumes two totally different meanings, so it doesn’t make sense to all the time symbolize the phrase with the identical vector.

With transformer-based structure, we’re in a position in the present day to create fashions that think about your entire context to generate the vectorial illustration of a phrase.

src: https://arxiv.org/abs/1706.03762

The principle innovation launched by this community is the multi-head consideration block. In case you are not aware of it, I not too long ago wrote an article about this: https://towardsdatascience.com/a-simple-implementation-of-the-attention-mechanism-from-scratch/

The transformer is made up of two components. The left half, which is the encoder which creates the vectorial illustration of texts, and the suitable half, the decoder, is used to generate new textual content. For instance, GPT relies on the suitable half, as a result of it generates textual content as a chatbot.

On this article, we have an interest within the encoder half, which is ready to seize the semantics of the textual content we give as enter.

BERT and RoBERTa

This gained’t be a course about these fashions, however let’s recap some most important matters.

Whereas ChatGPT is constructed on the decoder aspect of the transformer structure, BERT and RoBERTa are based mostly on the encoder aspect.

BERT was launched by Google in 2018 and you’ll learn extra about it right here: https://arxiv.org/abs/1810.04805

BERT is a stack of encoder layers. There are two sizes of this mannequin. BERT base incorporates 12 encoders whereas BERT massive incorporates 24 encoders

src: https://iq.opengenus.org/content material/photos/2021/01/bert-base-bert-large-encoders.png

BERT base generates a vector of dimension 768, whereas the big one a vector of dimension 1024. Each take an enter of dimension 512 tokens.

The tokenizer utilized by the BERT mannequin is named WordPiece.

BERT is skilled on two targets:

  • Masked Language Modeling (MLM): Predicts lacking (masked) tokens inside a sentence.
  • Subsequent Sentence Prediction (NSP): Determines whether or not a given second sentence logically follows the primary one.

RoBERTa mannequin builds on prime of BERT with some key variations: https://arxiv.org/abs/1907.11692.

RoBERTa makes use of a dynamic masking, so masked tokens change at each iteration throughout the coaching, and doesn’t use the NSP as coaching targets.

Use RoBERTa with SpaCy

The TextCategorizer is a spaCy element that predicts a number of labels for a complete doc. It may well work in two modalities:

  • exclusive_classes = true: one label per textual content (e.g., optimistic or unfavorable)
  • exclusive_classes = false: a number of labels per textual content (e.g., spam, pressing, billing)

spaCy can mix this with totally different embeddings:

  • Traditional phrase vectors (tok2vec)
  • Transformer fashions like RoBERTa, which we use right here

On this approach we are able to lavarage the RoBERTa understanding of the english language, and combine it within the spacy pipeline to make it manufacturing prepared.

If in case you have a dataset, you may additional prepare the RoBERTa mannequin utilizing spaCy to fine-tune it on the precise downstream job you’re making an attempt to unravel.

Dataset preparation

On this article I’m going to make use of the TREC dataset, which incorporates brief questions. Every query is labelled with the kind of reply it expects, similar to:

Label That means
ABBR Abbreviation
DESC Description / Definition
ENTY Entity (factor, object)
HUM Human (particular person, group)
LOC Location (place)
NUM Numeric (rely, date, and so forth)

That is an instance, the place we anticipate as reply a human identify:

Q (textual content): “Who wrote the Iliad?”
A (label): “HUM”

As regular we begin by putting in the libraries.

!pip set up datasets==3.6.0
!pip set up -U spacy[transformers]

Now we have to load put together the dataset.

With spacy.clean("en") we are able to create a clean spaCy pipeline for English. It doesn’t embrace any elements (just like the tagger or the parser),. It’s light-weight and ideal for changing uncooked textual content to Doc objects with out loading a full language mannequin like we do with en_core_web_sm.

DocBin is a particular spaCy class that effectively shops many Doc objects in binary format. That is how spaCy expects coaching knowledge to be saved.

As soon as transformed and saved as .spacy recordsdata, these might be handed instantly into spacy prepare, which is way sooner than utilizing plain JSON or textual content recordsdata.

So now this script to organize the prepare and dev dataset needs to be fairly easy.

from datasets import load_dataset
import spacy
from spacy.tokens import DocBin

# Load TREC dataset
dataset = load_dataset("trec")

# Get label names (e.g., ["DESC", "ENTY", "ABBR", ...])
label_names = dataset["train"].options["coarse_label"].names

# Create a clean English pipeline (no elements but)
nlp = spacy.clean("en")

# Convert Hugging Face examples into spaCy Docs and save as .spacy file
def convert_to_spacy(cut up, filename):
    doc_bin = DocBin()
    for instance in cut up:
        textual content = instance["text"]
        label = label_names[example["coarse_label"]]
        cats = {identify: 0.0 for identify in label_names}
        cats[label] = 1.0
        doc = nlp.make_doc(textual content)
        doc.cats = cats
        doc_bin.add(doc)
    doc_bin.to_disk(filename)

convert_to_spacy(dataset["train"], "prepare.spacy")
convert_to_spacy(dataset["test"], "dev.spacy")

We’re going to firther prepare RoBERTa on this dataset utilizing a sapCy CLI command. The command expects a config.cfg file the place we describe the kind of coaching, the mannequin we’re utilizing, the variety of epohchs and so forth.

Right here is the config file I used for my coaching pourposes.

[paths]
prepare = ./prepare.spacy
dev = ./dev.spacy
vectors = null
init_tok2vec = null

[system]
gpu_allocator = "pytorch"
seed = 42

[nlp]
lang = "en"
pipeline = ["transformer", "textcat"]
batch_size = 32

[components]

[components.transformer]
manufacturing unit = "transformer"

[components.transformer.model]
@architectures = "spacy-transformers.TransformerModel.v3"
identify = "roberta-base"
tokenizer_config = {"use_fast": true}
transformer_config = {}
mixed_precision = false
grad_scaler_config = {}

[components.transformer.model.get_spans]
@span_getters = "spacy-transformers.strided_spans.v1"
window = 128
stride = 96

[components.textcat]
manufacturing unit = "textcat"
scorer = {"@scorers": "spacy.textcat_scorer.v2"}
threshold = 0.5

[components.textcat.model]
@architectures = "spacy.TextCatEnsemble.v2"
nO = null

[components.textcat.model.linear_model]
@architectures = "spacy.TextCatBOW.v3"
ngram_size = 1
no_output_layer = true
exclusive_classes = true
size = 262144

[components.textcat.model.tok2vec]
@architectures = "spacy-transformers.TransformerListener.v1"
upstream = "transformer"
pooling = {"@layers": "reduce_mean.v1"}
grad_factor = 1.0

[corpora]

[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.prepare}

[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}

[training]
train_corpus = "corpora.prepare"
dev_corpus = "corpora.dev"
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
dropout = 0.1
accumulate_gradient = 1
persistence = 1600
max_epochs = 10
max_steps = 2000
eval_frequency = 100
frozen_components = []
annotating_components = []

[training.optimizer]
@optimizers = "Adam.v1"
learn_rate = 0.00005
L2 = 0.01
grad_clip = 1.0
use_averages = false
eps = 1e-08
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true

[training.batcher]
@batchers = "spacy.batch_by_words.v1"
discard_oversize = false
tolerance = 0.2

[training.batcher.size]
@schedules = "compounding.v1"
begin = 256
cease = 2048
compound = 1.001

[training.logger]
@loggers = "spacy.ConsoleLogger.v1"
progress_bar = true

[training.score_weights]
cats_score = 1.0

[initialize]
vectors = ${paths.vectors}
init_tok2vec = ${paths.init_tok2vec}
vocab_data = null
lookups = null

[initialize.components]
[initialize.tokenizer]

Be sure you have a GPU at your disposal and launch the coaching CLI command!

python —m spacy prepare config.cfg --output ./output --gpu-id 0

You will note the coaching beginning with and you’ll monitor the lack of the TextCategorizer element.

Simply to be clear, we’re coaching right here the TextCategorizer element, which is a small neural community head that receives the doc illustration and learns to foretell the right label.

However we’re additionally fine-tuning RoBERTa throughout this coaching. Meaning the RoBERTa weights are up to date utilizing the TREC dataset, so it learns find out how to symbolize enter questions in a approach that’s extra helpful for classification.

As soon as the mannequin is skilled and saved, we are able to use it in inference!

import spacy

nlp = spacy.load("output/model-best")

doc = nlp("What's the capital of Italy?")
print(doc.cats)

The output needs to be one thing much like the next

{'LOC': 0.98, 'HUM': 0.01, 'NUM': 0.0, …}

Ultimate Ideas

To recap, on this publish we noticed find out how to:Use a Hugging Face dataset with spaCy

  • Convert textual content classification knowledge into .spacy format
  • Configure a full pipeline utilizing RoBERTa and textcat
  • Prepare and take a look at your mannequin utilizing spaCy CLI

This technique works for any brief textual content classification job, emails, assist tickets, product evaluations, FAQs, and even chatbot intents.

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles

PHP Code Snippets Powered By : XYZScripts.com