Now that we’ve seen what can go wrong when running concurrent code, it’s time to discuss how to actually write concurrent code and what’s happening in the background. In order to do that efficiently, I need to assume some basic knowledge of computing fundamentals that you’d typically learn in Computer Science I or II. I’ll link to Wikipedia articles for those interested in a refresher.

If we were able to stop a program at any instant of time and look to see what the computer is storing, we’d notice the following. The computer stores a program counter, which we can think of as a pointer to the current line of code being executed. As execution of the program proceeds, the program counter is incremented line by line. It also stores the current values of all local variables in stack memory. Variables not on the stack are stored in a special location on the heap. As the execution of our program progresses, the current values of variables are updated on the stack or the heap, depending on where they live.

To make parallel execution possible, operating systems provide multiple system threads for executing code in parallel. Each system thread gets its own program counter and stack memory, based on what it’s executing at a given time.

The problem with system threads is that creating a new one is very slow, as on most operating systems, there is a lot of system overhead involved. Switching back and forth from one thread to another can also incur an additional context switch as the values of registers are saved elsewhere. There’s a limit, then to how many system threads we can create before most of our CPU time gets eaten up by context switching. Moreover, for most applications there’s very little benefit to creating many more system threads than the CPU has cores.

Green Threads

The way that Go tackles these challenges is via an idea called green threads, which aren’t new. They were a part of Java early on but were removed a few releases later. The basic idea is that instead of working directly with system threads, you make your own threads, and manage scheduling their execution on a small pool of system threads.

It may seem surprising that adding a layer of complexity would make a system better, but in this case we gain some significant advantages. Because we control the implementation, we can keep the threads we create small and cheap to create. We also don’t have to accept the overhead of a system call everytime we want to do something in parallel.

Go uses a system much like this, but calls its green threads goroutines. Each goroutine is started by forking off a func call using the go keyword. In code it looks like this.

// NOTE: non-example; do not use!

func doItInParallel() {
  // do it
}

func main() {
  go doItInParallel()   
}

The go keyword instructs the Go runtime to instantiate a new lightweight goroutine and use it to execute the provided function call. But the code we’ve written above is unlikely to work on its own, because of a property of the go keyword that we’ve yet to mention.

Immediately after each call to a function using go, the runtime guarantees that at some point in the future the goroutine we asked for will be created and start executing. But the runtime makes no guarantee about when that will occur. It could happen immediately, or it could take a very long time. All we are given is that eventually the goroutine we asked for will start executing.

But the code we wrote above has a flaw. If we try and run it, nothing happens! The reason for this is that the main() func exits immediately after our go statement. This demonstrates that statements using go are non-blocking – they don’t wait for the func to complete before continuing to the next line of code. To fix this, we need to use a WaitGroup, found in the sync package.

sync.WaitGroup

The sync package contains an array of useful tools for managing concurrency. Among those tools is the sync.WaitGroup, which allows goroutines to wait on one another. A sync.WaitGroup is essentially a integer counter (more specifically a counting semaphore) which exposes three methods:

  • Wait() suspends execution of the current goroutine until the counter becomes zero. If the counter is already zero, Wait() does nothing.
  • Add(x int) adds its argument to the counter. Adding a negative value is allowed, but if the counter ever becomes negative, the call to Add will panic.
  • Done() is an alias for Add(-1).

From the descriptions above and our previous discussion about the “happens-before” relation, there are a few properties we need to make sure we keep in mind while using a WaitGroup. Each will be illustrated below along with a non-example.

// NOTE: non-example; do not use!

func main() {
    var wg sync.WaitGroup    // (2)
    go func() {              // (1)
        wg.Add(1)
        doItInParallel()
        wg.Done()
    }() 
    wg.Wait()                // (3)
}

Above we see an incorrect implementation using WaitGroup. First, at (1) we replace our call to doItInParallel() with a call to an anonymous func which contains reasonable calls to Add() and Done(). We declare a WaitGroup at (2) without needing to initialize anything because the zero value of a WaitGroup is ready for use – the counter starts at zero. At (3) we call wg.Wait(), blocking the main thread until the WaitGroup is zero.

By now you might be able to see the problem. Any call to wg.Wait() will only block if the counter is non-zero. If we look at the events which happen before wg.Wait(), we only see the invocation of a new goroutine at (1), and the declaration of our WaitGroup at (2). You could be forgiven for thinking that the contents of the anonymous func happen before wg.Wait(), but that’s not the case. Every line appearing within the anonymous func executes on a separate goroutine. Even though the lines come before wg.Wait(), they happen on a different timeline.

When we say every line, we are including the call to wg.Add(1) – and this is where the problem arises. wg.Wait() only waits if its counter is non-zero. The only way the counter could be non-zero is if wg.Add(1) happens before wg.Wait(), but that’s not the case here. Whether it’s happened or not depends on when the goroutine we asked the runtime to start for us is scheduled. As programmers, we have zero control over this choice.

To fix the problem, we can move wg.Add(1) outside of the anonymous func, like below, to create an explicit happens-before relationship between wg.Add(1) and wg.Wait().

func main() {
    var wg sync.WaitGroup
    wg.Add(1)              // (4)
    go func() {
        doItInParallel()
        wg.Done()          // (5)
    }() 
    wg.Wait()              // (6)
}

This code does guarantee that doItInParallel() will execute before the program ends, but it’s worth diving in to the details to understand why. Since (4) happens before (6), we know that at some point in time prior to executing wg.Wait() the counter for wg was non-zero. The only way that wg’s counter can become zero again is if wg.Done() is executed – which only happens after doItInParallel() is run. We also know from the API docs that wg.Wait() only allows a thread to pass if its counter is zero. Finally, since wg.Wait() is the last line in the program, it follows that the program can only exit if doItInParallel() has completed.

The above paragraph sketches a formal proof that the program we’re looking at correctly executes doItInParallel() on a separate goroutine and allows the call to complete before terminating the program. If you followed it, well done. But in practice, we don’t wade through each and every detail of a proof every time we write some code. We look at the code to examine the happens-before relationships. Anytime we see a call to Wait(), we ask if there’s a happens-before relationship between Wait() and Add() on the same WaitGroup – if there is, then we know the Wait() and Add() calls are placed correctly.

Note that we require a similar happens-before relationship between Add() and Done(). Remember that the value of a WaitGroup can’t be negative without causing a panic. If Done() happens before every call to Add() of the same sync.WaitGroup, we risk a panic.

Finally, I suspect that it may be possible to write a linter to evaluate these happens-before relationships for WaitGroups in several common circumstances and alert developers before code makes it to production. At the time of this writing, it seems that no such linter currently exists.