Git as a management tool for training data and experiments in ML
In this part of the series of articles on MLOps, we start with information that will be familiar to most of you: With the basics of Git. However, to give a different perspective on the well-known tool, these basics provide the basis to highlight the function and benefits of Git for machine learning (ML) and the difference in managing training data.
Because compared to normal software development, other useful approaches emerge that are crucial for your work in ML with lots of data. In this post, we will show you why it is often sufficient to use well-known tools like Git for code and data management.
In Machine Learning, we need code and data for training and code for applying models. Step by step we get different versions of the trained model, often differing in small but important details. Gradually, quite a lot of versions and branches emerge. A big challenge is to manage these many versions and make them persist. This is the core of the different use of Git in the context of an ML project: Instead of a largely linear process development, as is usual in normal software development, in ML we branch off experimentally and then sometimes merge the features - but not always. This is why data management and experiment management are so essential in ML. In particular, the inventory of these experiment branches is essential - we’ll show you why now. Data as code is an important key term in this context, expressing how significant storing unique versioning of the training data you use is for your project management at Git. You never know when you might need to go back to an old version of data, for example because of a new paper where old data suddenly makes sense - or to compare how a change in data affects your model. With the help of the unique link through the git hash, you can easily go back to all intermediate steps and model variants. By saving the hash with the training result, you always have the unique assignment of your training data - no version gets lost.
Git for managing training data
In any normal ML development project, new data is always being added - images, text or other assets. With Git you can keep all your folders with the associated data - Git forgets nothing. As long as it’s not too much data, you can use Git for that. In ML we sometimes have to deal with a lot of data and quick changes. Git can sometimes reach its limits. The rule of thumb is: Git is sufficient for up to a few hundred MB for working with training data. We will show you which ways you can then go about working with even more data in one of the next articles, which will also deal with examples of use with shares, such as NAS (Network Attached Storage) or with extensions of Git to cope with larger amounts of data.
Git for managing experiments
Normally, feature and bugfix branches are deleted after a while after you have merged them into the main branch - they are usually no longer needed. So even though we keep branching off, there is still a “linear” development history after a merge. In ML, we experiment more often, we try things out, changes can be adopted or not. However, it is important to also archive discarded experiments and not to delete them completely. Because sometimes an idea can be picked up again months later, or you at least want to keep an old status as a reference to be able to understand why an idea didn’t work.
Since Git treats all branches the same in principle, it is an organisational issue in your team how to properly tag branches that will then be remembered and deleted or archived by giving them speaking names (for example, “experiment/gelu_activactions”). Traditional feature branches, bug-fix branches, etc., are remembered and deleted, as is usual in traditional software development. Branches with experiments, on the other hand, should be kept for the long term for the reasons mentioned above. Depending on how you host your central Git server (e.g. on Github, with bitbucket, GitLab or other solutions), settings on the repository or commit hooks can, in addition to organisational measures, ensure that experimental branches are kept. This way, Git can serve as an automatic backup of your training versions so that nothing is deleted.
Texts, assets, logos - but also training data - all in Git
As we have just described, you can basically do everything in Git. It allows you to have an 80-20 solution that supports you simply and efficiently in your work without having to include more complex, additional tools in your workflow. Before you start thinking about big machines, try out the possibilities that are available to you with Git. If you want to train your model with text, images or other data, or compare inputs and outputs and train iteratively, Git can manage data as well as code. Whether an image is simply a logo or a training date, as in traditional web development, Git doesn’t care - so why reach for a new tool? It is always important to use a descriptive name so as not to jeopardise the organisational structure. Here are some examples from our daily work to help you decide when you can comfortably manage your data together with training, validation and test data with the associated code in Git - and when not:
- Transfer training with texts: The advantage of new pre-trained language models is that less and less data is needed for the actual adaptation to a specific task. The special training data for our NLP/NLG engine “Sokratext”, for example, is only a few hand-curated MB of text. These can most conveniently be stored directly in Git with the training code - we have been doing it this way for a long time and save a lot of time in our workflow.
- Transfer training for an image classifier with a few hundred or thousand images: even a few hundred megabytes are no problem for Git in this case, as long as new data is added slowly. This way you know exactly which code state was trained with which data set.
- Training for smaller neural networks on a few MB of time series data: again, the type of data doesn’t matter, as long as your repo doesn’t grow too much, a unique git hash is the perfect reference for the training data used to track what a model was trained with.
- However, if the amount of data grows too much, e.g. in the case of extensive pretraining on Wikipedias, large web-scraping data sets or video data, then Git reaches its limits.
In practice, you often don’t need an additional tool that you have to train (and possibly pay for) to do ML. This means an enormous advantage in terms of saving time, since you already know all the tricks and moves for managing the code - and simply apply them to data. Perhaps this orientation will help you in your experiments and in your assessment of when Git is a good data management method for your project. How has Git helped you best in your development, and how has it not? Feel free to send us your feedback - we’d love to hear about your experience! You can also look forward to the upcoming articles - among other things, we will delve more into the topic of DJL as a framework and how you can integrate Maven into the development process.