Main page | About | Posts

An attempt at compiling GCC in Astral

One of the major goals of most OSDev projects is to be self hosting. There isn't a specific definition for it, which can range from simply having the kernel be able to compile itself, to having it be able to run all of the tools nescessary to modify its source, compile it and put that modified copy on disk or upload it to the internet.

Currently, Astral is able to do the latter. However, looking at all the components that make that possible, just doing that isn't enough. What good is it for a kernel to be able to build itself, if it can't build its own compiler? For Astral to be self-hosting, it will need to be able to compile its compiler, which in this case is GCC.

Configuring

The main ./configure --target=x86_64-astral --build=x86_64-astral --host=x86_64-astral --enable-languages=c,c++ --disable-bootstrap command ran just fine, so in this section I will also talk about the other configure scripts GCC calls as a part of the build (after running make), as those ran into a few issues.

The first issue I had encountered was an issue with mmap. The unmapping code in the VMM had a logic error which caused the refcount of some pages to be wrong, and the checking for working mmap... test caught that. It was a really simple issue which didn't take a lot to find and fix.

The second issue presented its symptoms during the build. However, it was happening because of something related to the configure results. The errno variable was being redefined in a libiberty .c file, which caused the compiler to error out. After inspecting the source file and the output from the configure script, I noticed this concerning line: checking for ANSI C header files... no.

After double checking the system include directory to make sure the headers hadn't vanished, I decided to look into what was causing this check to fail. After trying commands for a while, I eventually noticed a specific invocation of grep used in the configure script caused an mlibc panic because of the function splice being unimplemented. After commenting that out in the grep source code and recompiling it, configure was able to correctly detect the header files and continue with the build.

The third and last issue I ran into with the configure scripts was one in the checking dependency style of x86_64-astral-g++ -std=c++11... check. Looking at the config.log file, it appeared that I had a memory leak, as GCC was returning Out of memory (ENOMEM) errors while including files.

That made me go looking for leaks in the kernel (without even checking if there was a leak in the first place), so I spent a few hours writing a simple tool to track allocations done by the alloc.c kernel file and testing it out. I was able to find two nasty leaks in the ext2 driver and a small one in the hashtable implementation. However, after fixing those and trying to run it again... The same error happened.

Checking the available system memory using GDB showed that there was over 7 GB available, so it couldn't possibly have been a memory leak that was causing this. Then I remembered of a specific detail of my VMM: it refuses to map 0 sized ranges of memory. GCC was doing a 0 byte read of some file and, because of that, the vmm_map call in syscall_read was failing, which caused it to return an out of memory error to userspace. After fixing that, the error went away and the build continued.

Building

The actual build was more about increasing Astral's stability and waiting a long time. Linking cc1, for example, took over an hour. During this process, I was able to fix a few long standing instability bugs in the kernel, which is what I will talk about in this section.

The first bug was an annoying issue that happened with the virtio-block driver. It happened rarely enough for it to be hard to reproduce but often enough for it to interrupt you at the worst possible moments. Something was causing the blkdev->queuewaiting[buffidx] != NULL assert to fail, which happens when a response is received from the device but there is no thread waiting for it. My goal for that day was to finally fix this.

After poking around with GDB and rereading the code for the driver a few times, I noticed a race happening between the device and the thread enqueing a request: the thread would increment the driver queue index, but after the index was incremented and before the data was for the request was actually set up, the device would read the queue. This caused the response to be wrong and made the assert fail. The reason this happened so rarely was because it needed two threads to enqueue a request at the same time, which didn't happen often under a normal load.

The other issue was a double free in the ext2 filesystem code, which happened because I wasn't zeroing out the intermediate indirect blocks properly. This one was also difficult to reproduce under normal load but when compiling GCC I was able to consistently reproduce it in the middle of linking cc1.

The biggest issue I ran into, which is also preventing me from continuing, is a segfault in mlibc. More specifically, when x86_64-astral-g++ is compiling parser.cc, mlibc unmaps some memory currently in use by its allocator to store metadata, which then causes an access to invalid memory when the application calls malloc. I still don't know if this is an mlibc issue or an Astral issue and, due to my almost non-existant knowledge about the internals of mlibc, I am not able to debug this properly.

What now?

Until that bug is fixed, the GCC build cannot continue. I am very confident that this is one of the last few things remaning to do, though. And, once this issue is resolved, I believe there is a high chance of the build finishing without too many issues. Once that happens, I would love to try to do something like an Astral from scratch, where I would build a working Astral distribution inside itself.