Debugging a Go Deadlock With Delve

Written by Luke Arntz on April 22, 2025

tags:

Contents

I often develop personal tooling for work so I don’t have to do repetitive tasks manually. It’s best to create shared tooling, but sometimes shared tooling grows out of personal tooling. I have a toolbox repository with code that automates various troubleshooting and workaround tasks that fits into my personal workflow. This particular tool probably wouldn’t be used by anyone else, and because of that I haven’t taken the time to figure out why the applications hangs after running for long periods. Until now…

I’m using bboltdb for local caching. It’s a kv store written in Go, gets pulled in as a dependency, and uses a local file for persistence (vaguely similar to sqlite). Only one thread can have the database open, and it can’t be opened twice within the same thread. I’m caching some database responses for 12 hours, and found that if I left the application overnight it would deadlock on the first request (that used this specific cache) the next day .

This is a short story about how I was able to identify and fix that bug. It’s a simple example with just enough complexity that I think it’s useful to share.

Identifying the Problem

To identify the problem I left the application running overnight then made a request I knew would cause the application to hang.

The next thing I did was run delve (dlv) attached to the process.

Attach to the running process

Using the command dlv attach <pid> will attach the debugger to a process by id.

~/go/bin/dlv attach 410631

Listing Goroutines

When running delve attached to a running process the command grs will list all goroutines. While the application was hung this was the output of running grs.

(dlv) grs
  Goroutine 1 - User: /usr/local/go/src/net/fd_unix.go:172 net.(*netFD).accept (0x1e8af8) [IO wait 331702849444352]
  Goroutine 2 - User: /usr/local/go/src/runtime/proc.go:425 runtime.gopark (0x80e88) [force gc (idle) 332303094788726]
  Goroutine 3 - User: /usr/local/go/src/runtime/proc.go:425 runtime.gopark (0x80e88) [GC sweep wait]
  Goroutine 4 - User: /usr/local/go/src/runtime/proc.go:425 runtime.gopark (0x80e88) [GC scavenge wait]
  Goroutine 18 - User: /usr/local/go/src/runtime/proc.go:425 runtime.gopark (0x80e88) [finalizer wait 273482198405062]
  Goroutine 19 - User: /usr/local/go/src/runtime/proc.go:425 runtime.gopark (0x80e88) [chan receive]
  Goroutine 49 - User: /usr/local/go/src/runtime/proc.go:425 runtime.gopark (0x80e88) [debug call 273482198405062]
  Goroutine 66 - User: /usr/local/go/src/runtime/proc.go:425 runtime.gopark (0x80e88) [debug call 273482198405062]
  Goroutine 67 - User: /usr/local/go/src/runtime/proc.go:425 runtime.gopark (0x80e88) [debug call 273482198405062]
  Goroutine 68 - User: /usr/local/go/src/runtime/proc.go:425 runtime.gopark (0x80e88) [debug call 273482198405062]
  Goroutine 69 - User: /usr/local/go/src/runtime/proc.go:425 runtime.gopark (0x80e88) [debug call 273482198405062]
  Goroutine 70 - User: /usr/local/go/src/runtime/proc.go:425 runtime.gopark (0x80e88) [debug call 332303094788726]
  Goroutine 71 - User: /usr/local/go/src/runtime/proc.go:425 runtime.gopark (0x80e88) [debug call 275611426517220]
  Goroutine 72 - User: /usr/local/go/src/runtime/proc.go:425 runtime.gopark (0x80e88) [debug call 332303094788726]
  Goroutine 308 - User: /usr/local/go/src/runtime/time.go:300 time.Sleep (0x84ab0) [sleep]
  Goroutine 310 - User: /usr/local/go/src/runtime/time.go:300 time.Sleep (0x84ab0) [sleep]
* Goroutine 311 - User: /usr/local/go/src/net/fd_posix.go:55 net.(*netFD).Read (0x1e6f28) [IO wait 331702849444352]
[17 goroutines]

Investigate Threads in time.Sleep

Why are threads sleeping? There shouldn’t be any threads using time.Sleep in the application code itself.

I used the delve gr 310 command to attach to goroutine 310. Using the bt command (an alias for stack) will show a stack trace for this goroutine. This is what I saw:

(dlv) gr 310
Switched from 0 to 310 (thread 410631)

(dlv) bt
 0  0x0000000000080e88 in runtime.gopark
    at /usr/local/go/src/runtime/proc.go:425
 1  0x0000000000084ab0 in time.Sleep
    at /usr/local/go/src/runtime/time.go:300
 2  0x000000000038f99c in go.etcd.io/bbolt.flock
    at ./go/pkg/mod/go.etcd.io/bbolt@v1.3.10/bolt_unix.go:43
 3  0x0000000000394748 in go.etcd.io/bbolt.Open
    at ./go/pkg/mod/go.etcd.io/bbolt@v1.3.10/db.go:247
 4  0x0000000000546414 in toolbox/sreweb/internal/databroker.(*Broker).openCacheDB
    at ./projects/toolbox/overSREer/internal/databroker/broker.go:107
 5  0x00000000005465d8 in toolbox/sreweb/internal/databroker.(*Broker).cacheLookupPrefix
    at ./projects/toolbox/overSREer/internal/databroker/cache.go:21
 6  0x000000000054bc7c in toolbox/sreweb/internal/databroker.(*Broker).MessageSend
    at ./projects/toolbox/overSREer/internal/databroker/messagesend.go:60
 7  0x0000000000558b38 in toolbox/sreweb/pkg/cyberdyne.(*CyberDyne).MessageSend
    at ./projects/toolbox/overSREer/pkg/cyberdyne/messagesend.go:8
 8  0x0000000000558b38 in main.GetMessageSend
    at ./projects/toolbox/overSREer/cmd/web/messagesend_handlers.go:24
 9  0x0000000000301ca8 in net/http.HandlerFunc.ServeHTTP
    at /usr/local/go/src/net/http/server.go:2220
10  0x0000000000303914 in net/http.(*ServeMux).ServeHTTP
    at /usr/local/go/src/net/http/server.go:2747
11  0x0000000000559f6c in main.(*AppState).routes.(*AppState).logRequest.func2
    at ./projects/toolbox/overSREer/cmd/web/app.go:49
12  0x0000000000301ca8 in net/http.HandlerFunc.ServeHTTP
    at /usr/local/go/src/net/http/server.go:2220
13  0x000000000031f4dc in net/http.serverHandler.ServeHTTP
    at /usr/local/go/src/net/http/server.go:3210
14  0x000000000030085c in net/http.(*conn).serve
    at /usr/local/go/src/net/http/server.go:2092
15  0x0000000000305670 in net/http.(*Server).Serve.gowrap3
    at /usr/local/go/src/net/http/server.go:3360
16  0x0000000000088db4 in runtime.goexit
    at /usr/local/go/src/runtime/asm_arm64.s:1223

This thread shows that openCacheDB is stuck sleeping. But why? Probably because another goroutine already had the database open.

The bbolt readme says:

Please note that Bolt obtains a file lock on the data file so multiple processes cannot open the same database at the same time.

Let’s look into bboltdb bolt_unix.go:43 where we find the flock function.

// flock acquires an advisory lock on a file descriptor.
func flock(db *DB, exclusive bool, timeout time.Duration) error {
	var t time.Time
	if timeout != 0 {
		t = time.Now()
	}
	fd := db.file.Fd()
	flag := syscall.LOCK_NB
	if exclusive {
		flag |= syscall.LOCK_EX
	} else {
		flag |= syscall.LOCK_SH
	}
	for {
		// Attempt to obtain an exclusive lock.
		err := syscall.Flock(int(fd), flag)
		if err == nil {
			return nil
		} else if err != syscall.EWOULDBLOCK {
			return err
		}

		// If we timed out then return an error.
		if timeout != 0 && time.Since(t) > timeout-flockRetryTimeout {
			return ErrTimeout
		}

		// Wait for a bit and try again.
		time.Sleep(flockRetryTimeout)
	}
}

This function can be called with a timeout, but if it is not it will loop forever trying to obtain a file lock on the database. Hmm… Something definitely has the database open, but it doesn’t appear to be this one. Here it’s the openCacheDB function that is blocked (it must already be open).

Let’s check the other thread that is in time.Sleep. Changing to goroutine 308 (using delve command gr 308) and running bt again we see the following…

(dlv) gr 308
Switched from 310 to 308 (thread 410631)

(dlv) bt
 0  0x0000000000080e88 in runtime.gopark
    at /usr/local/go/src/runtime/proc.go:425
 1  0x0000000000084ab0 in time.Sleep
    at /usr/local/go/src/runtime/time.go:300
 2  0x000000000038f99c in go.etcd.io/bbolt.flock
    at ./go/pkg/mod/go.etcd.io/bbolt@v1.3.10/bolt_unix.go:43
 3  0x0000000000394748 in go.etcd.io/bbolt.Open
    at ./go/pkg/mod/go.etcd.io/bbolt@v1.3.10/db.go:247
 4  0x0000000000546414 in toolbox/sreweb/internal/databroker.(*Broker).openCacheDB
    at ./projects/toolbox/overSREer/internal/databroker/broker.go:107
 5  0x0000000000544dd4 in toolbox/sreweb/internal/databroker.(*Broker).syncAccounts
    at ./projects/toolbox/overSREer/internal/databroker/account.go:74
 6  0x0000000000544630 in toolbox/sreweb/internal/databroker.(*Broker).Account
    at ./projects/toolbox/overSREer/internal/databroker/account.go:65
 7  0x000000000055120c in toolbox/sreweb/internal/databroker.Broker.accountMongoCluster
    at ./projects/toolbox/overSREer/internal/databroker/util.go:6
 8  0x000000000054c214 in toolbox/sreweb/internal/databroker.(*Broker).MessageSend
    at ./projects/toolbox/overSREer/internal/databroker/messagesend.go:93
 9  0x0000000000558b38 in toolbox/sreweb/pkg/cyberdyne.(*CyberDyne).MessageSend
    at ./projects/toolbox/overSREer/pkg/cyberdyne/messagesend.go:8
10  0x0000000000558b38 in main.GetMessageSend
    at ./projects/toolbox/overSREer/cmd/web/messagesend_handlers.go:24
11  0x0000000000301ca8 in net/http.HandlerFunc.ServeHTTP
    at /usr/local/go/src/net/http/server.go:2220
12  0x0000000000303914 in net/http.(*ServeMux).ServeHTTP
    at /usr/local/go/src/net/http/server.go:2747
13  0x0000000000559f6c in main.(*AppState).routes.(*AppState).logRequest.func2
    at ./projects/toolbox/overSREer/cmd/web/app.go:49
14  0x0000000000301ca8 in net/http.HandlerFunc.ServeHTTP
    at /usr/local/go/src/net/http/server.go:2220
15  0x000000000031f4dc in net/http.serverHandler.ServeHTTP
    at /usr/local/go/src/net/http/server.go:3210
16  0x000000000030085c in net/http.(*conn).serve
    at /usr/local/go/src/net/http/server.go:2092
17  0x0000000000305670 in net/http.(*Server).Serve.gowrap3
    at /usr/local/go/src/net/http/server.go:3360
18  0x0000000000088db4 in runtime.goexit
    at /usr/local/go/src/runtime/asm_arm64.s:1223
(dlv)

This goroutine is deeper in the application and is likely what’s keep causing the deadlock. This goroutine has called Account() which then called syncAccounts() which calls openCacheDB and then goes to sleep forever.

Here’s the Account() function and a snippet of syncAccounts(). Can you spot the problem?

func (b *Broker) Account(id string) (model.Account, error) {
	var a model.Account
	now := time.Now().UTC()

	if err := b.openCacheDB(); err != nil {
		b.log.Error("b.openCacheDB", slog.String("error", err.Error()))
		return model.Account{}, err
	}
	defer b.db.Close()

	// check cache expiration; return account if not expired
	err := b.db.View(func(tx *bolt.Tx) error {
		bucket := tx.Bucket([]byte(CacheAccountBucket))
		if bucket == nil {
			return fmt.Errorf("bucket [%s] does not exist", CacheAccountBucket)
		}

		t := bucket.Get([]byte("expires"))
		if t != nil {
			expires, err := time.Parse(CacheTimeLayout, string(t))
			if err != nil {
				b.log.Error("unable to parse time", slog.String("expires", string(t)))
				return err
			}
			if expires.Before(now) {
				return errors.New("accounts cache expired")
			}
			account := bucket.Get([]byte(id))
			decryptedValue, err := b.decrypt(account)
			if err != nil {
				return err
			}
			if err := json.Unmarshal(decryptedValue, &a); err != nil {
				return err
			}
		} else {
			return errors.New("no key 'expires'")
		}
		return nil
	})
	if err == nil {
		return a, nil
	}

	// account doesn't exist, cache is expired, or some other err
	if err.Error() != "accounts cache expired" {
		return model.Account{}, err
	} else {
		// cache is expired so syncAccounts and lookup the account again
		if err := b.syncAccounts(); err != nil {
			return model.Account{}, err
		}
		return b.Account(id)
	}
}

func (b *Broker) syncAccounts() error {
	if err := b.openCacheDB(); err != nil {
		b.log.Error("b.openCacheDB", slog.String("error", err.Error()))
		return err
	}
	defer b.db.Close()

	...
}

The Account() function calls openCacheDB() and uses defer b.db.Close() as one does when wanting to ensure they clean up after themselves.

This is normally fine, but…

The cache expires after 12 hours, and the Account() function calls syncAccounts() to refresh the accounts cache if it is expired. This seems like a nice idea, but the call to syncAccounts() also attempts to open the database. Unfortunately for syncAccounts() the deferred b.db.Close() call in Account() isn’t called until Account() returns.

And there we sit, deadlocked.

Fixing The Bug

To fix this bug I made two changes.

Set a Timeout on bolt.Open()

Boltdb supports passing a timeout to the Open() function. Using a timeout can prevent deadlocks. This wouldn’t truly “fix the bug” but it will prevent the application from hanging forever and return an ErrTimeout.

Please note that Bolt obtains a file lock on the data file so multiple processes cannot open the same database at the same time. Opening an already open Bolt database will cause it to hang until the other process closes it. To prevent an indefinite wait you can pass a timeout option to the Open() function: db, err = bolt.Open(p, 0600, &bolt.Options{Timeout: 1 * time.Second})

Remove openCacheDB() From syncAccounts()

To fix the actual bug of not being able to sync the accounts cache from within the Account() function I removed the openCacheDB() call within syncAccounts().

The syncAccounts() function is only used in two places in the application. At start-up and when the accounts cache is expired. Removing the openCacheDB() function call from syncAccounts() and only calling it from within other functions that have already opened the database (and deferred closing it) will allow the accounts cache to update as intended.

Conclusion

This was a pretty straightforward bug that I probably should have caught previously. My only excuse is that it’s a locally running application that doesn’t get a lot of use, and restarting the application allowed it to work correctly (for up to 12 hours).

Having the time and motivation to look into it seemed like a nice opportunity to break out the debugger and write a blog post. Hopefully, this can be useful to someone that isn’t familiar with how using a debugger can help, uh, debug applications.

Recapping the issue and what was done to resolve it:

  • The application would hang after running “overnight” and needing to access the accounts cache.
  • With the application in a hung state we attached the delve debugger to the processing using the process id.
  • Reviewing the running goroutines revealed that two were stuck in time.Sleep.
  • Goroutine 310 was sleeping waiting for openCacheDB() to return.
  • Goroutine 308 was further along. On this thread we saw a function call to Account() and then syncAccounts().
  • Examining these functions more closely revealed that Account() was closing the db using defer. This meant syncAccounts() wasn’t able to get a lock on the database and the thread deadlocked.
  • Had goroutine 308 every returned 310 would have moved further along and eventually gotten deadlocked in the same place as 308.
  • Refactoring the code to ensure the database would only be opened once prevents the deadlock.

Feedback is welcome,connect with me on mastodon and bluesky.

References

Related Articles

Top