There are numerous challenges in terms of machine studying fashions in manufacturing. These vary from reproducibility in versioning to safe serialization. On this weblog put up, I’ll stroll you thru a library referred to as skops, to sort out these challenges.
We are going to see an end-to-end instance: prepare a mannequin first, serialize it, doc our mannequin, and host it.
import sklearn
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from datasets import load_dataset
# Load the info and cut up
knowledge = load_dataset(“scikit-learn/breast-cancer-wisconsin”)
df = knowledge[“train”].to_pandas()
y = df[“diagnosis”]
X = df.drop(“prognosis”, axis=1)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42
)
pipe = Pipeline(
steps=[
(“imputer”, SimpleImputer()),
(“scaler”, StandardScaler()),
(“model”, LogisticRegression())
]
)
pipe.match(X_train, y_train)
We are going to now save the mannequin. We are able to save the mannequin utilizing any format, together with, joblib, pickle or skops.
skops introduces a brand new serialization format. The motivation right here is to keep away from using pickle or joblib to serialize sklearn fashions. Serialization with pickle or joblib can lead to unhealthy actors executing code in your native machine and you must keep away from deserializing a pickle file if it’s from a supply you don’t belief. It’s a serialization protocol that serializes directions on your code in binary format, thus, it’s not human-readable. It may possibly virtually do something: take away all the things in your machine or set up malware. It is best to solely deserialize a pickle from a supply that you simply belief.
The serialization format launched by skops doesn’t depend on pickle and lets customers see what a given file comprises with out loading it. You may learn extra about it right here. Let’s check out the API.
It can save you a sklearn mannequin or pipeline by passing the article and the file path that it is going to be saved to.
sio.dump(pipe, “pipeline.skops”)
The difficult half is to load the mannequin from a file. We are going to cross the file path to load. Now we have yet another parameter referred to as trusted which might be both True, listing of trusted sorts, or False. If set to False, it should solely load the trusted sorts. Let’s have a look.
sio.load(“pipeline.skops”, trusted=True)
# outcome
Pipeline(steps=[(‘imputer’, SimpleImputer()), (‘scaler’, StandardScaler()),
(‘model’, LogisticRegression())])
We are able to get an inventory of untrusted sorts utilizing get_untrusted_types.
print(unknown_types)
# output
[‘numpy.int64’]
You may immediately cross the above listing to trusted.
If you happen to attempt to load with out the transformer on prime, loading will fail with an UntrustedTypesFoundException.
# output
UntrustedTypesFoundException: Untrusted sorts discovered within the file: [‘numpy.int64’].
Be aware that you simply all the time must cross one thing to trusted, as this prompts the person to find out whether or not to belief this file or not.
# output
UntrustedTypesFoundException: Untrusted sorts discovered within the file: [‘numpy.int64’].
If you happen to’d wish to host fashions open to everybody, you are able to do that with skops and Hugging Face Hub. This allows straightforward inference with out downloading the mannequin, mannequin documentation within the mannequin repository, and constructing interfaces with one line of code.
Hugging Face Mannequin Repository
That is constructed with one line of code
Let’s see easy methods to create these programmatically.
hub_utils.init creates a neighborhood folder containing the mannequin within the given path, and the configuration file containing the necessities of the setting the mannequin is educated in, the coaching goal, a pattern from the dataset, and extra. The pattern knowledge and the duty identifier handed to the init will assist Hugging Face Hub allow the inference widget on the mannequin web page in addition to discoverability options to seek out the mannequin.
Be aware: The inference widget, inference API, and gradio integration solely work with pickle format for now. We’re presently growing assist for skops format. Subsequently, we’ll save the mannequin in pickle for now.
import pickle
# let’s save the mannequin
model_path = “instance.pkl”
local_repo = “my-awesome-model”
with open(model_path, mode=”bw”) as f:
pickle.dump(pipe, file=f)
# we’ll now initialize a neighborhood repository
hub_utils.init(
mannequin=model_path,
necessities=[f”scikit-learn={sklearn.__version__}”],
dst=local_repo,
job=”tabular-classification”,
knowledge=X_test,
)
The repository now comprises the mannequin and the configuration file that allow inference, construct setting to load the mannequin and extra. The configuration file is a JSON that comprises:
a small pattern of the dataset,
columns of the dataset,
the setting necessities to load the mannequin,
relative path to the mannequin file contained in the repository,
the duty that’s being solved.
Now, we’ll doc our mannequin by making a mannequin card. The mannequin playing cards in skops comply with the format of Hugging Face Hub mannequin playing cards: it consists of a markdown half and yaml metadata part. You may take a look at the keys of the metadata part right here for higher discoverability of the fashions. The mannequin card follows a template that consists of:
YAML part on prime for metadata (job ID, license, library title used for coaching, and extra)
free textual content part within the format of markdown and sections to be crammed (e.g. description of the mannequin, supposed use, limitations and extra),
Under sections of the mannequin card are routinely generated by skops:
Hyperparameters of the mannequin,
Interactive diagram of the mannequin,
A small snippet that reveals easy methods to load and use the mannequin,
For metadata, library title, job identifier (e.g. tabular-classification), and data required by the inference widget is crammed.
skops permits programmatic enhancing of the mannequin card by numerous strategies. The documentation on the cardboard module and the default template supplied by skops is right here.
You may instantiate the Card class from skops to create the mannequin card. This class is an intermediate knowledge construction that’s later rendered to markdown. We are going to later save this card to the repository the place the mannequin can be hosted. Through the initialization of the repository, the duty title (e.g. tabular-regression) and library title (e.g. scikit-learn) are written to the configuration file throughout repository initialization. Job and library names are additionally wanted within the card’s metadata, so you should use the metadata_from_config methodology to extract the metadata from the configuration file and cross it to the cardboard once you create it. You should use add methodology so as to add data and edit metadata.
from pathlib import Path
# create the cardboard
model_card = card.Card(pipe, metadata=card.metadata_from_config(Path(local_repo)))
limitations = “This mannequin shouldn’t be prepared for use in manufacturing.”
model_description = (
“This can be a LogisticRegression mannequin educated on breast most cancers dataset.”
)
# add data to the mannequin card
model_card.add(**{“Mannequin description/Supposed makes use of & limitations”: limitations})
# set the license within the metadata
model_card.metadata.license = “mit”
We are able to consider the mannequin and write it to the mannequin card as metric. We are able to use the add_metrics methodology which provides metrics to our mannequin card and writes as a desk.
accuracy_score, f1_score)
# let’s make a prediction and consider the mannequin
y_pred = pipe.predict(X_test)
# we are able to cross metrics utilizing add_metrics and cross particulars with add
model_card.add_metrics(accuracy=accuracy_score(y_test, y_pred))
model_card.add_metrics(**{“f1 rating”: f1_score(y_test, y_pred, common=”micro”)})
Plots that visualize mannequin efficiency might be added utilizing add_plots.
# we’ll create a confusion matrix
cm = confusion_matrix(y_test, y_pred, labels=pipe.classes_)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=pipe.classes_)
disp.plot()
# save the plot
plt.savefig(Path(local_repo) / “confusion_matrix.png”)
# the plot can be written to the mannequin card below the title confusion_matrix
# we cross the trail of the plot itself
model_card.add_plot(**{
“Confusion Matrix”: “path-to-confusion-matrix.png”})
Let’s save the mannequin card within the native repository. The file title right here needs to be README.md since it’s what Hugging Face Hub expects.
The repository is now prepared for push to Hugging Face Hub. We are able to use hub_utils for this. Hugging Face Hub follows an authentication movement with tokens, so we are able to cross our token in push.
hub_utils.push(
repo_id=”scikit-learn/blog-example”,
supply=local_repo,
commit_message=”pushing information to the repo from the instance!”,
create_remote=True,
)
As soon as the mannequin is on the Hugging Face Hub, it may be downloaded by anybody utilizing obtain, except the mannequin is non-public. The repository comprises the mannequin, mannequin card, and mannequin configuration that comprises a small pattern of the dataset for reproducibility, necessities, and extra.
hub_utils.obtain(repo_id=”scikit-learn/blog-example”, dst=”downloaded-model”)
The mannequin might be simply examined utilizing the inference widget.
Inference Widget in Repository
We are able to now use gradio integration for skops. We’ve created the interface beneath with just one line of code! 🤩
Gradio UI for our mannequin
gr.Interface.load(“huggingface/scikit-learn/skops-blog-example”).launch()
We are able to additional customise this UI like the next:
We are able to cross title, description, and extra to the loaded UI. Take a look at gradio documentation on Interface class for extra data on what you’ll be able to customise.
gr.Interface.load(“huggingface/scikit-learn/blog-example”,
title=”Logistic Regression on Breast Most cancers”).launch()
The ensuing repository is right here.
Additional Assets
Merve Noyan is a Google developer professional on machine studying and developer advocate at Hugging Face.