Quick software tips for new ML researchers
Just a quick set of tips for new ML researchers working in Python who are likely self-taught and haven't had a mentor to guide them on best practices. They're small, easy, and will generally improve your productivity and level of professionalism. I wrote this up for a class intended to teach DL to engineers without an ML or software background and thought I'd share it.
- If you're running experiments with varied hyper-parameters and settings that you're setting in configs, either use an existing config system or be willing to build one that's quite good. Config systems let you easily control all of the parameters that define the experiment without having to go into a file and change it for each one. They also come with a host of additional nice features. Some nice ones that I've used are
- Pyrallis
- Hydra
- Use a package manager instead of installing all of your packages into a single base environment. Package managers let you construct different versions of python environments that are separated from one another which is helpful because not all versions of python libraries are compatible with each other. They're also a godsend when months later you need to rebuild your environment for one reason or another as they give you a reproducible way to construct the environment. Pick one whether it's Conda, uv-pip, Poetry, pixi. Anytime you want to add a package to your codebase, do not just pip install, make sure to add it to the environment file as well. You'll be happier for it later.
- Do not just develop locally, make sure you are frequently pushing to Github so your code doesn't get lost. As a side benefit of this, if your experiments stop working at some point, you can then use
git bisect
to find the exact commit at which point they stopped working in a logarithmic number of trials!
- Use a linter so that your code is always nice and clean. I like
ruff
because it's fast. You can install an automatic linter using pre-commit so that when you try to commit code, it instantly lints for you and won't let you commit unclean code (which you will often be tempted to do).
- Related to the above point, use git and commit frequently. When developing new features, do so on a branch and merge it into your main branch when it's ready. For solo projects this maybe isn't always necessary but it's not terrible to get into the habit as eventually you'll collaborate with someone.
- Related to the above point, I've observed that students frequently like Jupyter notebooks (I personally don't like them because it's hard to use a debugger inside of them). These things are hell on git because each small change in the notebook changes tens or hundreds of lines of code in the notebook, making it really hard to read the commits later. To make it so that your git history is actually manageable, I recommend using something like
jupytext
and jupyter lab
to interact with it. jupytext
represents all your files as plain text or markdown when you save it so it keeps your commits nice and tidy. I'll say though that this point is the one I'm least insistent on. You could also try Marimo which I've heard is really cool.
- If you have a SLURM cluster, set up a launcher file that'll write SBATCH scripts for you. I personally like submitit but I expect there to be other stuff out there that might be even better.
- If you have a SLURM cluster, don't run experiments on your desktop! It's fine to do dev on your desktop but ML is all about throughput and every time you run an experiment on your desktop instead of many cluster nodes, you're throughput bottlenecked.
- Assuming you're doing ML, don't tune your hyper-parameters by hand. Use a tuning library or write your own script that does some kind of random / bayesian search. Wandb has an integration for this or you can use Optuna or Ray Tune or any of a million other ones or just roll your own. Your skill as a researcher here is in figuring out what should be tuned and reasonable ranges for it but you weren't born to tune hyperparameters.
- Please don't do string checking things like
if variable=='some_string'
to configure your experiments. This is super error prone and will almost certainly trip you up at some point. Instead, you can use Enums or Structured Configs in Hydra. I will admit that this one is just me being nit-picky but it seems like such an opportunity for unnecessary bugs.
Finally, I should note, I'm not a professional software engineer! These are things that work for me, my students, and I'm sure that some of them will be things other engineers or researchers scoff at.