tags:
Contents
I often develop personal tooling for work so I don’t have to do repetitive tasks manually. It’s best to create shared tooling, but sometimes shared tooling grows out of personal tooling. I have a toolbox
repository with code that automates various troubleshooting and workaround tasks that fits into my personal workflow. This particular tool probably wouldn’t be used by anyone else, and because of that I haven’t taken the time to figure out why the applications hangs after running for long periods. Until now…
I’m using bboltdb for local caching. It’s a kv store written in Go, gets pulled in as a dependency, and uses a local file for persistence (vaguely similar to sqlite). Only one thread can have the database open, and it can’t be opened twice within the same thread. I’m caching some database responses for 12 hours, and found that if I left the application overnight it would deadlock on the first request (that used this specific cache) the next day .
This is a short story about how I was able to identify and fix that bug. It’s a simple example with just enough complexity that I think it’s useful to share.
Identifying the Problem
To identify the problem I left the application running overnight then made a request I knew would cause the application to hang.
The next thing I did was run delve (dlv
) attached to the process.
Attach to the running process
Using the command dlv attach <pid>
will attach the debugger to a process by id.
~/go/bin/dlv attach 410631
Listing Goroutines
When running delve attached to a running process the command grs
will list all goroutines. While the application was hung this was the output of running grs
.
(dlv) grs
Goroutine 1 - User: /usr/local/go/src/net/fd_unix.go:172 net.(*netFD).accept (0x1e8af8) [IO wait 331702849444352]
Goroutine 2 - User: /usr/local/go/src/runtime/proc.go:425 runtime.gopark (0x80e88) [force gc (idle) 332303094788726]
Goroutine 3 - User: /usr/local/go/src/runtime/proc.go:425 runtime.gopark (0x80e88) [GC sweep wait]
Goroutine 4 - User: /usr/local/go/src/runtime/proc.go:425 runtime.gopark (0x80e88) [GC scavenge wait]
Goroutine 18 - User: /usr/local/go/src/runtime/proc.go:425 runtime.gopark (0x80e88) [finalizer wait 273482198405062]
Goroutine 19 - User: /usr/local/go/src/runtime/proc.go:425 runtime.gopark (0x80e88) [chan receive]
Goroutine 49 - User: /usr/local/go/src/runtime/proc.go:425 runtime.gopark (0x80e88) [debug call 273482198405062]
Goroutine 66 - User: /usr/local/go/src/runtime/proc.go:425 runtime.gopark (0x80e88) [debug call 273482198405062]
Goroutine 67 - User: /usr/local/go/src/runtime/proc.go:425 runtime.gopark (0x80e88) [debug call 273482198405062]
Goroutine 68 - User: /usr/local/go/src/runtime/proc.go:425 runtime.gopark (0x80e88) [debug call 273482198405062]
Goroutine 69 - User: /usr/local/go/src/runtime/proc.go:425 runtime.gopark (0x80e88) [debug call 273482198405062]
Goroutine 70 - User: /usr/local/go/src/runtime/proc.go:425 runtime.gopark (0x80e88) [debug call 332303094788726]
Goroutine 71 - User: /usr/local/go/src/runtime/proc.go:425 runtime.gopark (0x80e88) [debug call 275611426517220]
Goroutine 72 - User: /usr/local/go/src/runtime/proc.go:425 runtime.gopark (0x80e88) [debug call 332303094788726]
Goroutine 308 - User: /usr/local/go/src/runtime/time.go:300 time.Sleep (0x84ab0) [sleep]
Goroutine 310 - User: /usr/local/go/src/runtime/time.go:300 time.Sleep (0x84ab0) [sleep]
* Goroutine 311 - User: /usr/local/go/src/net/fd_posix.go:55 net.(*netFD).Read (0x1e6f28) [IO wait 331702849444352]
[17 goroutines]
Investigate Threads in time.Sleep
Why are threads sleeping? There shouldn’t be any threads using time.Sleep
in the application code itself.
I used the delve gr 310
command to attach to goroutine 310. Using the bt
command (an alias for stack
) will show a stack trace for this goroutine. This is what I saw:
(dlv) gr 310
Switched from 0 to 310 (thread 410631)
(dlv) bt
0 0x0000000000080e88 in runtime.gopark
at /usr/local/go/src/runtime/proc.go:425
1 0x0000000000084ab0 in time.Sleep
at /usr/local/go/src/runtime/time.go:300
2 0x000000000038f99c in go.etcd.io/bbolt.flock
at ./go/pkg/mod/go.etcd.io/bbolt@v1.3.10/bolt_unix.go:43
3 0x0000000000394748 in go.etcd.io/bbolt.Open
at ./go/pkg/mod/go.etcd.io/bbolt@v1.3.10/db.go:247
4 0x0000000000546414 in toolbox/sreweb/internal/databroker.(*Broker).openCacheDB
at ./projects/toolbox/overSREer/internal/databroker/broker.go:107
5 0x00000000005465d8 in toolbox/sreweb/internal/databroker.(*Broker).cacheLookupPrefix
at ./projects/toolbox/overSREer/internal/databroker/cache.go:21
6 0x000000000054bc7c in toolbox/sreweb/internal/databroker.(*Broker).MessageSend
at ./projects/toolbox/overSREer/internal/databroker/messagesend.go:60
7 0x0000000000558b38 in toolbox/sreweb/pkg/cyberdyne.(*CyberDyne).MessageSend
at ./projects/toolbox/overSREer/pkg/cyberdyne/messagesend.go:8
8 0x0000000000558b38 in main.GetMessageSend
at ./projects/toolbox/overSREer/cmd/web/messagesend_handlers.go:24
9 0x0000000000301ca8 in net/http.HandlerFunc.ServeHTTP
at /usr/local/go/src/net/http/server.go:2220
10 0x0000000000303914 in net/http.(*ServeMux).ServeHTTP
at /usr/local/go/src/net/http/server.go:2747
11 0x0000000000559f6c in main.(*AppState).routes.(*AppState).logRequest.func2
at ./projects/toolbox/overSREer/cmd/web/app.go:49
12 0x0000000000301ca8 in net/http.HandlerFunc.ServeHTTP
at /usr/local/go/src/net/http/server.go:2220
13 0x000000000031f4dc in net/http.serverHandler.ServeHTTP
at /usr/local/go/src/net/http/server.go:3210
14 0x000000000030085c in net/http.(*conn).serve
at /usr/local/go/src/net/http/server.go:2092
15 0x0000000000305670 in net/http.(*Server).Serve.gowrap3
at /usr/local/go/src/net/http/server.go:3360
16 0x0000000000088db4 in runtime.goexit
at /usr/local/go/src/runtime/asm_arm64.s:1223
This thread shows that openCacheDB
is stuck sleeping. But why? Probably because another goroutine already had the database open.
The bbolt readme says:
Please note that Bolt obtains a file lock on the data file so multiple processes cannot open the same database at the same time.
Let’s look into bboltdb bolt_unix.go:43
where we find the flock
function.
// flock acquires an advisory lock on a file descriptor.
func flock(db *DB, exclusive bool, timeout time.Duration) error {
var t time.Time
if timeout != 0 {
t = time.Now()
}
fd := db.file.Fd()
flag := syscall.LOCK_NB
if exclusive {
flag |= syscall.LOCK_EX
} else {
flag |= syscall.LOCK_SH
}
for {
// Attempt to obtain an exclusive lock.
err := syscall.Flock(int(fd), flag)
if err == nil {
return nil
} else if err != syscall.EWOULDBLOCK {
return err
}
// If we timed out then return an error.
if timeout != 0 && time.Since(t) > timeout-flockRetryTimeout {
return ErrTimeout
}
// Wait for a bit and try again.
time.Sleep(flockRetryTimeout)
}
}
This function can be called with a timeout, but if it is not it will loop forever trying to obtain a file lock on the database. Hmm… Something definitely has the database open, but it doesn’t appear to be this one. Here it’s the openCacheDB
function that is blocked (it must already be open).
Let’s check the other thread that is in time.Sleep
. Changing to goroutine 308 (using delve command gr 308
) and running bt
again we see the following…
(dlv) gr 308
Switched from 310 to 308 (thread 410631)
(dlv) bt
0 0x0000000000080e88 in runtime.gopark
at /usr/local/go/src/runtime/proc.go:425
1 0x0000000000084ab0 in time.Sleep
at /usr/local/go/src/runtime/time.go:300
2 0x000000000038f99c in go.etcd.io/bbolt.flock
at ./go/pkg/mod/go.etcd.io/bbolt@v1.3.10/bolt_unix.go:43
3 0x0000000000394748 in go.etcd.io/bbolt.Open
at ./go/pkg/mod/go.etcd.io/bbolt@v1.3.10/db.go:247
4 0x0000000000546414 in toolbox/sreweb/internal/databroker.(*Broker).openCacheDB
at ./projects/toolbox/overSREer/internal/databroker/broker.go:107
5 0x0000000000544dd4 in toolbox/sreweb/internal/databroker.(*Broker).syncAccounts
at ./projects/toolbox/overSREer/internal/databroker/account.go:74
6 0x0000000000544630 in toolbox/sreweb/internal/databroker.(*Broker).Account
at ./projects/toolbox/overSREer/internal/databroker/account.go:65
7 0x000000000055120c in toolbox/sreweb/internal/databroker.Broker.accountMongoCluster
at ./projects/toolbox/overSREer/internal/databroker/util.go:6
8 0x000000000054c214 in toolbox/sreweb/internal/databroker.(*Broker).MessageSend
at ./projects/toolbox/overSREer/internal/databroker/messagesend.go:93
9 0x0000000000558b38 in toolbox/sreweb/pkg/cyberdyne.(*CyberDyne).MessageSend
at ./projects/toolbox/overSREer/pkg/cyberdyne/messagesend.go:8
10 0x0000000000558b38 in main.GetMessageSend
at ./projects/toolbox/overSREer/cmd/web/messagesend_handlers.go:24
11 0x0000000000301ca8 in net/http.HandlerFunc.ServeHTTP
at /usr/local/go/src/net/http/server.go:2220
12 0x0000000000303914 in net/http.(*ServeMux).ServeHTTP
at /usr/local/go/src/net/http/server.go:2747
13 0x0000000000559f6c in main.(*AppState).routes.(*AppState).logRequest.func2
at ./projects/toolbox/overSREer/cmd/web/app.go:49
14 0x0000000000301ca8 in net/http.HandlerFunc.ServeHTTP
at /usr/local/go/src/net/http/server.go:2220
15 0x000000000031f4dc in net/http.serverHandler.ServeHTTP
at /usr/local/go/src/net/http/server.go:3210
16 0x000000000030085c in net/http.(*conn).serve
at /usr/local/go/src/net/http/server.go:2092
17 0x0000000000305670 in net/http.(*Server).Serve.gowrap3
at /usr/local/go/src/net/http/server.go:3360
18 0x0000000000088db4 in runtime.goexit
at /usr/local/go/src/runtime/asm_arm64.s:1223
(dlv)
This goroutine is deeper in the application and is likely what’s keep causing the deadlock. This goroutine has called Account()
which then called syncAccounts()
which calls openCacheDB
and then goes to sleep forever.
Here’s the Account()
function and a snippet of syncAccounts()
. Can you spot the problem?
func (b *Broker) Account(id string) (model.Account, error) {
var a model.Account
now := time.Now().UTC()
if err := b.openCacheDB(); err != nil {
b.log.Error("b.openCacheDB", slog.String("error", err.Error()))
return model.Account{}, err
}
defer b.db.Close()
// check cache expiration; return account if not expired
err := b.db.View(func(tx *bolt.Tx) error {
bucket := tx.Bucket([]byte(CacheAccountBucket))
if bucket == nil {
return fmt.Errorf("bucket [%s] does not exist", CacheAccountBucket)
}
t := bucket.Get([]byte("expires"))
if t != nil {
expires, err := time.Parse(CacheTimeLayout, string(t))
if err != nil {
b.log.Error("unable to parse time", slog.String("expires", string(t)))
return err
}
if expires.Before(now) {
return errors.New("accounts cache expired")
}
account := bucket.Get([]byte(id))
decryptedValue, err := b.decrypt(account)
if err != nil {
return err
}
if err := json.Unmarshal(decryptedValue, &a); err != nil {
return err
}
} else {
return errors.New("no key 'expires'")
}
return nil
})
if err == nil {
return a, nil
}
// account doesn't exist, cache is expired, or some other err
if err.Error() != "accounts cache expired" {
return model.Account{}, err
} else {
// cache is expired so syncAccounts and lookup the account again
if err := b.syncAccounts(); err != nil {
return model.Account{}, err
}
return b.Account(id)
}
}
func (b *Broker) syncAccounts() error {
if err := b.openCacheDB(); err != nil {
b.log.Error("b.openCacheDB", slog.String("error", err.Error()))
return err
}
defer b.db.Close()
...
}
The Account()
function calls openCacheDB()
and uses defer b.db.Close()
as one does when wanting to ensure they clean up after themselves.
This is normally fine, but…
The cache expires after 12 hours, and the Account()
function calls syncAccounts()
to refresh the accounts cache if it is expired. This seems like a nice idea, but the call to syncAccounts()
also attempts to open the database. Unfortunately for syncAccounts()
the deferred b.db.Close()
call in Account()
isn’t called until Account()
returns.
And there we sit, deadlocked.
Fixing The Bug
To fix this bug I made two changes.
Set a Timeout on bolt.Open()
Boltdb supports passing a timeout to the Open()
function. Using a timeout can prevent deadlocks. This wouldn’t truly “fix the bug” but it will prevent the application from hanging forever and return an ErrTimeout
.
Please note that Bolt obtains a file lock on the data file so multiple processes cannot open the same database at the same time. Opening an already open Bolt database will cause it to hang until the other process closes it. To prevent an indefinite wait you can pass a timeout option to the Open() function: db, err = bolt.Open(p, 0600, &bolt.Options{Timeout: 1 * time.Second})
Remove openCacheDB()
From syncAccounts()
To fix the actual bug of not being able to sync the accounts cache from within the Account()
function I removed the openCacheDB()
call within syncAccounts()
.
The syncAccounts()
function is only used in two places in the application. At start-up and when the accounts cache is expired. Removing the openCacheDB()
function call from syncAccounts()
and only calling it from within other functions that have already opened the database (and deferred closing it) will allow the accounts cache to update as intended.
Conclusion
This was a pretty straightforward bug that I probably should have caught previously. My only excuse is that it’s a locally running application that doesn’t get a lot of use, and restarting the application allowed it to work correctly (for up to 12 hours).
Having the time and motivation to look into it seemed like a nice opportunity to break out the debugger and write a blog post. Hopefully, this can be useful to someone that isn’t familiar with how using a debugger can help, uh, debug applications.
Recapping the issue and what was done to resolve it:
- The application would hang after running “overnight” and needing to access the accounts cache.
- With the application in a hung state we attached the delve debugger to the processing using the process id.
- Reviewing the running goroutines revealed that two were stuck in
time.Sleep
. - Goroutine 310 was sleeping waiting for
openCacheDB()
to return. - Goroutine 308 was further along. On this thread we saw a function call to
Account()
and thensyncAccounts()
. - Examining these functions more closely revealed that
Account()
was closing the db usingdefer
. This meantsyncAccounts()
wasn’t able to get a lock on the database and the thread deadlocked. - Had goroutine 308 every returned 310 would have moved further along and eventually gotten deadlocked in the same place as 308.
- Refactoring the code to ensure the database would only be opened once prevents the deadlock.
Feedback is welcome,connect with me on mastodon and bluesky.