Performance-Profiling Swift on Linux: Getting Started

Learn how to profile Server-Side Swift with perf on Linux. You’ll discover the basic principles of profiling and how to view events, call-graph-traces and perform basic analysis. By kelvin ma.

Leave a rating/review
Download materials
Save for later
Share
You are currently viewing page 2 of 4 of this article. Click here to view the first page.

Measuring Performance

Before jumping into perf itself, let’s clarify what kind of performance you’re optimizing for. Broadly speaking, there are two categories of performance:

  • Latency is the amount of time that passes between the moment you need to compute a result and the moment you receive that result.
  • Resource utilization is the amount of processor cycles a computation consumes.

The difference between the two is important because you measure them with different tools. Timers are good at measuring latency. However, timers aren’t great at measuring resource utilization because they keep ticking when a different task is running. If your application awaits a lot, timing won’t tell you much about its resource utilization.

Samplers are an alternative to timers. Samplers interrupt your application randomly and record what code is executing at that instant. Samplers are good at measuring resource utilization because they collect samples only while your application is running.

You can sample an application manually by running it many times and randomly interrupting it to see what stack trace it halts on. The perf tool automates this process by calling into a CPU component called a performance monitoring unit (PMU). A PMU can sample your application thousands of times per run.

The crypto-bot application is compute-constrained, not network-constrained. This means it has a resource utilization problem, so you’ll need to use a sampler like perf to optimize it.

Profiling With Perf

The Docker image you built already comes with perf installed. To test the perf installation, sample the sleep command:

perf record sleep 1

You should see something like this:

[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.014 MB perf.data (8 samples) ]

Since sleep spends most of its time sleeping, perf only recorded a handful of samples from it.

Try sampling crypto-bot by running the following commands:

perf record .build/release/crypto-bot

This will result in something similar to:

FTM-PERP: 7094 update(s)
...
MEDIA-PERP: 64 update(s)
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.210 MB perf.data (5099 samples) ]

Above the perf record footer, you’ll see that perf reprinted all the normal output from crypto-bot as it was running.

If you look closer, you’ll notice perf collected many more samples from crypto-bot than it did from sleep, and it saved them to a file called perf.data. This file is called a performance profile. The perf tool has another command — script — that you can use to load samples from a profile.

Loading Samples With Perf Script

Try running perf script without any arguments:

perf script

This command is interactive. Press space or arrow keys to navigate and Q to exit. If you’re running Docker natively on Linux you’ll see something like:

perf       16 7056.026690:    161 cycles:ppp: ffffffffa1c735bb native_write_msr ([kernel.kallsyms])
perf       16 7056.026696:   1031 cycles:ppp: ffffffffa1c735b4 native_write_msr ([kernel.kallsyms])
perf       16 7056.026702:   6650 cycles:ppp: ffffffffa1c0d345 intel_tfa_pmu_enable_all ([kernel.kallsyms])
crypto-bot 16 7056.026712:  41385 cycles:ppp: ffffffffa208f6d5 apparmor_bprm_committing_creds ([kernel.kallsyms])
crypto-bot 16 7056.026763: 155508 cycles:ppp: ffffffffa1e58909 unmap_region ([kernel.kallsyms])
crypto-bot 16 7056.027434: 233299 cycles:ppp: ffffffffa2296747 _extract_crng ([kernel.kallsyms])
...

If you’re running Docker on macOS you’ll see something like:

crypto-bot   857 338493.651057:     250000 cpu-clock:  ffff800010c7030c [unknown] ([kernel.kallsyms])
crypto-bot   857 338493.651306:     250000 cpu-clock:      ffff838a6ce4 _dl_relocate_object (/usr/lib64/ld-2.26.so)
crypto-bot   857 338493.651558:     250000 cpu-clock:  ffff800010643a10 [unknown] ([kernel.kallsyms])
crypto-bot   857 338493.651807:     250000 cpu-clock:      ffff838a6cd8 _dl_relocate_object (/usr/lib64/ld-2.26.so)
crypto-bot   857 338493.652056:     250000 cpu-clock:      ffff838a5490 _dl_lookup_symbol_x (/usr/lib64/ld-2.26.so)
crypto-bot   857 338493.652305:     250000 cpu-clock:      ffff838a4c6c do_lookup_x (/usr/lib64/ld-2.26.so)
crypto-bot   857 338493.652556:     250000 cpu-clock:      ffff838a5490 _dl_lookup_symbol_x (/usr/lib64/ld-2.26.so)
crypto-bot   857 338493.652806:     250000 cpu-clock:      ffff838a4c2c do_lookup_x (/usr/lib64/ld-2.26.so)
crypto-bot   857 338493.653056:     250000 cpu-clock:      ffff838b3b10 strcmp (/usr/lib64/ld-2.26.so)
crypto-bot   857 338493.653305:     250000 cpu-clock:      ffff838a5490 _dl_lookup_symbol_x (/usr/lib64/ld-2.26.so)
crypto-bot   857 338493.653556:     250000 cpu-clock:  ffff800010643a7c [unknown] ([kernel.kallsyms])
crypto-bot   857 338493.653806:     250000 cpu-clock:      ffff838b3b24 strcmp (/usr/lib64/ld-2.26.so)
crypto-bot   857 338493.654056:     250000 cpu-clock:      ffff838b3ab0 strcmp (/usr/lib64/ld-2.26.so)
crypto-bot   857 338493.654306:     250000 cpu-clock:      ffff82ff4808 malloc_consolidate (/usr/lib64/libc-2.26.so)
crypto-bot   857 338493.654556:     250000 cpu-clock:  ffff800010c7030c [unknown] ([kernel.kallsyms])
...

This output contains a human-readable description of each sample in perf.data. By default, perf record names its output perf.data, so by default, perf script looks for a file with this name.

Note: perf script output can run wide in the terminal. Some of the following output snippets use ellipses (...) to represent elided output.

You can name the profile something else with the --output option, or -o for short. Try generating a profile named profile.perf by rerunning perf record with the following options:

perf record -o profile.perf .build/release/crypto-bot

To load it in perf script, use the --input option, or -i for short:

perf script -i profile.perf

Each line represents a single sample, so if perf record recorded 5099 samples, perf script emits 5099 lines of output. You can customize the output with the --fields option, or -F for short.

Getting Task Information

The default perf script output is complex, so try running perf script again with the -F comm option:

perf script -i profile.perf -F comm

The “comm” keyword stands for “command;” accordingly you will see perf script print the command associated with each sample:

perf 
perf 
crypto-bot 
crypto-bot 

You should see many samples associated with crypto-bot and a smaller number of samples associated with perf itself.

You can get the thread group and thread identifier for each sample by enabling the pid and tid fields. Pass multiple fields to perf script as a comma-separated list:

perf script -i profile.perf -F comm,pid,tid

You’ll see a new column of output containing two numbers. The first number is the thread group identifier; the second number is the thread identifier. The perf script tool always prints the fields in the same order, no matter what order you entered them in.

perf       542/542
perf       542/542
crypto-bot 542/542
crypto-bot 542/542

Because crypto-bot doesn’t use concurrency, every sample comes from the same thread. If you examined a performance profile from a concurrent application, you would see samples with many thread identifiers, but the same thread group identifier.

Getting Timeline Information

You can get the time stamp for each sample by enabling the time field. Try adding time to the fields list:

perf script -i profile.perf -F comm,pid,tid,time
perf       542/542 7040.081164: 
perf       542/542 7040.081168: 
crypto-bot 542/542 7040.081176: 
crypto-bot 542/542 7040.081243: 

The timestamps represent seconds elapsed since the system booted up.

Performance profiles can store many kinds of events. To get the event type for each sample, enable the event field:

perf script -i profile.perf -F comm,pid,tid,time,event

You’ll see a new column containing the event type for each sample. If you have access to hardware events, you’ll see:

perf       542/542 7040.081164: cycles:ppp: 
perf       542/542 7040.081168: cycles:ppp: 
crypto-bot 542/542 7040.081176: cycles:ppp: 
crypto-bot 542/542 7040.081243: cycles:ppp: 

All samples in profile.perf came from the cycles:ppp event. The name means “(very) precise CPU cycles”. This event plugs into a cycle counter in the PMU, and triggers a sample at every multiple of some interval.

If you’re running in an environment where the Docker container doesn’t have access to hardware events (like on macOS), you’ll see:

crypto-bot   861/861   364407.450503: cpu-clock: 
crypto-bot   861/861   364407.450641: cpu-clock: 
crypto-bot   861/861   364407.450901: cpu-clock: 
crypto-bot   861/861   364407.452287: cpu-clock: 
crypto-bot   861/861   364407.452632: cpu-clock: 

In this instance, all events came from the cpu-clock event. This event is triggered by a timer perf creates and triggers a sample at every multiple of some interval.

You can view the sampling intervals by enabling the period field:

perf script -i profile.perf -F comm,pid,tid,time,event,period

Note that perf script formats the period column before the event name. On macOS hosts, you’ll see a regular time interval:

crypto-bot   861/861   364407.450503:     250000 cpu-clock: 
crypto-bot   861/861   364407.450641:     250000 cpu-clock: 
crypto-bot   861/861   364407.450901:     250000 cpu-clock: 
crypto-bot   861/861   364407.452287:     250000 cpu-clock: 

On Linux, you’ll see a different period:

perf       542/542 7040.081164:   1012 cycles:ppp: 
perf       542/542 7040.081168:   8668 cycles:ppp: 
crypto-bot 542/542 7040.081176:  72998 cycles:ppp: 
crypto-bot 542/542 7040.081243: 358563 cycles:ppp: 

The sampling interval varies across samples because the kernel tunes it as your application runs. If you scroll down far enough, you should see the kernel eventually herds it into an equilibrium.

crypto-bot 542/542 7040.086985: 325029 cycles:ppp: 
crypto-bot 542/542 7040.087241: 322303 cycles:ppp: 
crypto-bot 542/542 7040.087495: 321438 cycles:ppp: