Performance-Profiling Swift on Linux: Getting Started

Learn how to profile Server-Side Swift with perf on Linux. You’ll discover the basic principles of profiling and how to view events, call-graph-traces and perform basic analysis. By kelvin ma.

Leave a rating/review
Download materials
Save for later
Share
You are currently viewing page 3 of 4 of this article. Click here to view the first page.

Getting Landmark Information

Sampling works by taking snapshots of the CPU’s instruction pointer (IP). You can view the instruction pointers in the ip field:

perf script -i profile.perf -F comm,pid,tid,time,event,period,ip
perf       542/542 7040.081164:   1012 cycles:ppp: ffffffff878735b4
perf       542/542 7040.081168:   8668 cycles:ppp: ffffffff884016f2
crypto-bot 542/542 7040.081176:  72998 cycles:ppp: ffffffff87ad38f0
crypto-bot 542/542 7040.081243: 358563 cycles:ppp: ffffffff88277427

By itself, this field isn’t too useful. To get the string names of the symbols associated with the instruction pointers, enable the sym field. To get their distances from the instruction pointers, enable the symoff field, which stands for symbol offsets. Add sym and symoff to the list of fields:

perf script -i profile.perf -F comm,pid,tid,time,event,period,ip,sym,symoff
...   1012 cycles:ppp: ffffffff878735b4 native_write_msr+0x4
...   8668 cycles:ppp: ffffffff884016f2 nmi_restore+0x25
...  72998 cycles:ppp: ffffffff87ad38f0 install_exec_creds+0x0
... 358563 cycles:ppp: ffffffff88277427 clear_page_erms+0x7

You can also view the binaries that the instructions originated from in the dso field. The name stands for dynamic shared object. Try adding it to the list of fields:

perf script -i profile.perf -F comm,pid,tid,time,event,period,ip,sym,symoff,dso

You’ll see the file path to the binary in parentheses after the symbol name and offset, or [kernel.kallsyms] if the symbol lives in the kernel.

... cycles:ppp: ffffffff878735b4 native_write_msr+0x4 ([kernel.kallsyms])
... cycles:ppp: ffffffff884016f2 nmi_restore+0x25 ([kernel.kallsyms])
... cycles:ppp: ffffffff87ad38f0 install_exec_creds+0x0 ([kernel.kallsyms])
... cycles:ppp: ffffffff88277427 clear_page_erms+0x7 ([kernel.kallsyms])

Scroll down the list of samples and look for binaries you recognize. You should easily find:

  • /usr/lib/swift/linux/libswiftCore.so: This is the Swift standard library.
  • /crypto-bot/.build//release/crypto-bot: This is the crypto-bot executable itself.

Working With Perf Events

It’s possible to capture events besides the CPU cycles. Although it’s only a subset of the events most PMUs support, perf record can also capture:

  • instructions: Similar to cycles, except it samples on a counter that tracks instructions instead of clock cycles. The two can give varying results because some instructions take multiple cycles to complete.
  • branches: Samples on a counter that tracks branch instructions. Branch instructions are critical because they are very expensive if incorrectly predicted.
  • branch-misses: Samples on a counter that tracks branch predictor misses.
  • cache-misses: Samples on a counter that tracks CPU cache misses.

In a normal optimization workflow, you would use the general cycles event to determine which parts of your code are slow, and then you’d use the more specialized events to figure out why those areas are slow.

You can get the full list of events your PMU supports with the perf list command:

perf list
List of pre-defined events (to be used in -e):

  branch-instructions OR branches                    [Hardware event]
  branch-misses                                      [Hardware event]
  bus-cycles                                         [Hardware event]
  ...

  alignment-faults                                   [Software event]
  bpf-output                                         [Software event]
  ...

Your output might be different. Besides the name of the event, perf list tells you what kind of event (hardware, software, etc.) it is.

To record an event besides cycles, pass its name to perf record’s --event option, or its shortened form -e. To record multiple events at once, pass their names as a comma-separated list. For example, to sample branch misses and cache misses, pass the following options:

perf record -o branch-and-cache-misses.perf -e branch-misses,cache-misses .build/release/crypto-bot
Note: This will only work on Linux hosts. If you’re using macOS continue to the next section.

Try viewing the output in perf script:

perf script -i branch-and-cache-misses.perf -F comm,pid,tid,time,event,period,ip,sym,symoff,dso

You will see events of types branch-misses and cache-misses in place of cycles:ppp.

...  1 branch-misses: ffffffff879fcae1 perf_iterate_ctx+0x1 ([kernel.kallsyms])
...  1  cache-misses: ffffffff879fcae1 perf_iterate_ctx+0x1 ([kernel.kallsyms])
... 31 branch-misses: ffffffff878f0a61 __update_load_avg_cfs_rq+0x1a1 ([kern...
... 84  cache-misses: ffffffff878f0a61 __update_load_avg_cfs_rq+0x1a1 ([kern...

Collecting Call Graph Traces

So far, you’ve learned how to reconstruct a timeline of crypto-bot’s execution and read off some basic information about which function frame was at the top of the stack at each point in time. To get a better idea of what’s happening in an application you don’t only want to know what it’s executing at a moment, but also what sequence of calls it took to get there. This information is called a call graph trace. It’s like a stack dump, except there’s one for every sample in the recording.

Tracing With Frame Pointers

One way to capture a call graph trace is to follow the frame pointer (FP). In Swift and many other compiled languages, frame pointers form a linked list so debuggers and profilers can follow to reconstruct a call graph. Rerun perf record again, this time with the --call-graph fp option:

perf record -o profile.perf --call-graph fp .build/release/crypto-bot

You’ll see perf write more than twice as much data to disk than it did last time.

...
MEDIA-PERP: 64 update(s)
[ perf record: Woken up 3 times to write data ]
[ perf record: Captured and wrote 0.630 MB profile.perf (5121 samples) ]

Call graph traces take up a lot of space, which is why perf doesn’t record them by default.

Warning: Frame pointer call graphs require a frame pointer! If you compile a binary with -fomit-frame-pointer, you can’t generate call graph traces for it.

Viewing Call Graph Traces

Try viewing the regenerated profile in perf script:

perf script -i profile.perf -F comm,pid,tid,time,event,period,ip,sym,symoff,dso

In place of the ip, sym, symoff and dso fields, perf script now produces multiple rows of these fields for each sample in the profile. Because each sample now spans multiple lines, perf script now separates samples with empty newlines.

perf 14/14 8358.627472: 1 cycles:ppp: 
    7fff8e810345 intel_tfa_pmu_enable_all+0x800071803035 ([kernel.kallsyms])
    7fff8e80a598 x86_pmu_enable+0x800071803118 ([kernel.kallsyms])
    ...
    7fff8e806fd7 do_syscall_64+0x800071803057 ([kernel.kallsyms])
    7fff8f40308c entry_SYSCALL_64_after_hwframe+0x800071803044 ([kernel.k...

The first row is the same as the ip, sym, symoff and dso fields in the profile without call graph information. The subsequent rows list the function frames that were below it when perf recorded the sample.

The perf record tool supports other tracing methods besides frame pointer tracing, including:

  • dwarf: Uses compiler-generated metadata in the binaries in place of frame pointers. This generates more detailed information, but also increases profiling overhead.
  • lbr: Uses the last branch table in the CPU in place of frame pointers. Only Intel CPUs support this tracing method.

Working With Call Graph Traces

Call graph traces contain a lot of information, not all of which is relevant to your optimization problem.

Often, the startup phase of your application is different from its running phase. You can exclude the startup phase from the recordings by setting a delay with perf record’s --delay option, or -D for short. You express the delay in milliseconds. For example, to make perf record wait 500 milliseconds before recording:

perf record --delay=500 --call-graph fp -o profile-delayed.perf .build/release/crypto-bot

You should see perf record fewer samples to disk than it did without the delay.

...
[ perf record: Woken up 2 times to write data ]
[ perf record: Captured and wrote 0.336 MB profile-delayed.perf (2827 samples) ]

For more flexible time-slicing, you can attach a recorder to an already-running process with the --pid option:

.build/release/crypto-bot & perf record --call-graph fp -o profile-attached.perf --pid $(pidof crypto-bot)
Note: On powerful machines, like Apple Silicon Macs, the binary may finish running before perf is started! If this happens you’ll see an error because pidof crypto-bot doesn’t return anything.

You can also limit the number of samples perf records overall with the --freq option, or -F for short. The name stands for “sampling frequency.” On average, perf will attempt to collect the number of samples you specify every second. Try setting it to 100 samples per second:

perf record --freq 100 --call-graph fp -o profile-lite.perf .build/release/crypto-bot

You will see perf recorded only a fraction of the samples it did before.

[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.027 MB profile-fixed.perf (114 samples) ]