Semantic Versioning for Data Science Models
If you’ve ever wanted to tag your data science model, you’ve probably wondered how to version it. Which will it be: vx.4.1, v34.1231.51.21, or v91.x4.dev34? After reading about semantic versioning, I propose a method for versioning data science models.
Semantic Versioning for Software
Semantic versioning proposes the following:
Given a version number MAJOR.MINOR.PATCH, increment the:
- MAJOR version when you make incompatible API changes,
- MINOR version when you add functionality in a backwards-compatible manner, and
- PATCH version when you make backwards-compatible bug fixes.
Additional labels for pre-release and build metadata are available as extensions to the MAJOR.MINOR.PATCH format.
Well, I don’t build APIs right now, but I think there is a way to apply this to the ways in which my software does change.
Semantic Versioning for Data Science Models
I propose that data science models use BETWEEN.WITHIN.PROCESS versioning, incrementing the
- BETWEEN version when you have an incompatible data change or target variable change, which render your models incomparable to prior versions, the
- WITHIN version when you improve or enhance a model output, and the
- PROCESS version when you update a pipeline but it doesn’t ultimately improve the model.
(Where BETWEEN implies ‘between model’ and WITHIN for ‘within model’.)
I build data science models by building python packages and committing the code to GitHub. The python package contains all of the support files and a
main.py file to run the pipeline from start to finish. The support files could contain either helpful loggers to tell me where the pipeline is breaking down, diagnostic tools such as an AUC-ROC plot, files to engineer features, or files to train different model types. So there are really these three things that could happen to my code at any time. It doesn’t matter where those changes occur, or how many lines of code changed. What matters is what’s happened to the model. Each time I get a pull request approved, I’ll update the version number in my repo.
Increment the BETWEEN version for the following changes:
- When the structure of your data changes
- If your target variable changes (how you coded it, or data that produced the target changed)
- The underlying population you’re training on changes
Assume I discovered that I had a bad join. The join change might have been a minor one, but it’s now difficult to really compare the two models because my target was affected as a result. What I was predicting, even if slightly different, is now something different. If you ever feel like you’re comparing apples to oranges when looking at an AUC-ROC curve, update the BETWEEN version.
With each BETWEEN change, there should be a clear communication in the release notes of why the model is inherently different from prior models. These changes should be less frequent.
Increment the WITHIN number when the following happens:
- Features are added
- Data sources are added or updated. (You might have several data sources today that help contribute to the target. But adding a data source might just mean adding new features. If you add data in such a way that it changes what you’re ultimately modeling, update the BETWEEN version.)
- New modeling types are added. (You might be using a logistic regression, but add a mo)
So if I added files that allowed me to train different model types, or added files that engineered new features, I’d incrememnt the WITHIN version. In my mind, this shouldn’t be a count of features you include, but should just be incremented every time you do something that affects the model performance.
At any time, you might what to view different subsets of features and their effect on the model. So, you might not change the number of features, but add some functionality into your pipeline that now produces three different models in each run. I would increment the WITHIN version number in that case.
Remember, the goal of this type of versioning is to show generally what’s changed in your pipeline.
Increment the PROCESS version when the following happens:
- A minor bug is fixed (unless this updates BETWEEN or WITHIN)
- Enhance a logger or aspect of the pipeline
- Add a diagnostic plot or table
Basically, any time you add something that doesn’t ultimately affect the performance of the model, you should update the PROCESS version. Your changes might improve the flow of your code and make you a much happier developer, but if it doesn’t improve the model, then the key priority of your code hasn’t improved.
Ultimately, data science semantic versioning can help communicate to team members what has fundamentally changed between software versions. Easily, a manager, colleague, or model governance officer could see the following version numbers and be able to know generally what’s changed:
0.0.1: Process improvement
0.1.0: Model improvement
0.2.0: Model improvement
1.0.0: Model change
1.1.0: Model improvement
1.1.1: Process improvement
So what changed?
0.0.1: Setting up the code base, but no model or features yet
0.1.0: Finished a model, or added some features
0.2.0: Added some features
1.0.0: Discovered a bug in the target variable, something was miscoded. What is being predicted now changes (but you still have all the old features!)
1.1.0: Added some new features
1.1.1: Added a new plot to assess the financial impact of the model, but model doesn’t inherently change.
This type of versioning also ultimately encourages a more disciplined workflow. Each sets of commit should be focused on one task alone. Because every time you update a version number, numbers to the right should refresh to zero. That’s how the theory goes, at least.