I’m not being lazy, the bloody piece of shite is compiling
Today was a day of deep learning, which is to say that nothing, but absolutely nothing worked as it was supposed to and it took ages to figure out why.
As of last Friday my project is maintaining two trunks of a rapidly diverging code base - one for the current generation of hardware designated support and master for whatever is coming next.
Adapting our CI was as simple as cloning the build job pipeline and telling git to check out a different branch. And that worked for about 2 builds at which point the first gate keeper that checks the integration of device and host PC started failing for the support builds.
The error came from our host tools, which are written in C# and it provoked a big, loud “WAT?” as per the initial stages of debugging since it was a “I can’t load type so-and-so in assembly so-and-so” for a type that was definitely nowhere to be found in the sources.
Here I have to diverge and explain that we are exposing part of the C# .NET funtionality over COM for use in Ruby scripts and for that reason in the test stages the system registers a couple of assemblies using regasm.
Now, the system includes 3 C/C++ compilers, python, tcl, ruby, a whole bunch of DSL generation steps in addition to the msbuild invocations so for convenience and speed the clean step is removing the directory where the build artifacts are saved.
I am a big fan of out-of-band builds (builds that do not mix sources with artifacts) and the fast, permanent and - up until now - reliable clean function is one of my favorite features.
Trust msbuild to find a way to work around this. So what happened?
Cause
The cuplrit is called IntermediateOutputPath and is a setting in Visual Studio project files that tells Visual Studio where to place the object files created by it’s compiler before linking. It’s that obj directory that pops up whenever you press F7.
There is no way to set this in the project properties as far as I know. You have to edit the .csproj file and add it, which is what we do (like I said out-of-band builds).
It seems that under certain circumstances (which I have not been able to recreate yet) Visual Studio takes it upon itself to change the value of this parameter.
It then sets it to a random temporary directory using ENV[TEMP] as the basis and puts that value in the .csproj file which obviously gets commited.
And this happened in an indeterminate point in the past before we ever had multiple trunks.
Effect
Switch now to the main build server who in the alleged security of out-of-band builds attempts to build both trunks.
Given a team that commits rapidly on the master branch clearing technical debt and sparingly on the support branch, you will very quickly reach the point where the sources on the support branch are by definition older than the corresponding object files in said “temporary” directory.
But msbuild doesn’t know that. It sees newer object files and hapilly goes “I don’t have to build this, hihi” and links some random piece of code for you. And since you’re an idiot and forgot to de-register the assembly before building any type mismatches are not reported by the linker. Hilarity ensues.
Lessons learned
- This is a fine example of the evils of shared state.
- Not one tool vendor has really thought about building things in parallel.
- Always do clean builds on your CI.
- Containers are nice, but…Windows.
- Never assume the tool knows better and never assume you cleaned up everything.
- Visual Studio is evil.
- There is always something you missed. Have multiple levels of testing.