Chapters

Hide chapters

Metal by Tutorials

Fourth Edition · macOS 14, iOS 17 · Swift 5.9 · Xcode 15

Section I: Beginning Metal

Section 1: 10 chapters
Show chapters Hide chapters

Section II: Intermediate Metal

Section 2: 8 chapters
Show chapters Hide chapters

Section III: Advanced Metal

Section 3: 8 chapters
Show chapters Hide chapters

31. Performance Optimization
Written by Marius Horga & Caroline Begbie

Heads up... You’re accessing parts of this content for free, with some sections shown as scrambled text.

Heads up... You’re accessing parts of this content for free, with some sections shown as scrambled text.

Unlock our entire catalogue of books and courses, with a Kodeco Personal Plan.

Unlock now

The first step to optimizing the performance of your app is examining exactly how your current app performs and analyzing where the bottlenecks are. The starter app provided with this chapter, even with several render passes, runs quite well as is, but you’ll study its performance so that you know where to look when you develop real-world apps.

The Starter App

➤ In Xcode, build and run the starter app for this chapter.

The starter app
The starter app

There are several render passes involved:

  • ShadowRenderPass: Renders models to depth texture.
  • ForwardRenderPass: Renders all models aside from rocks and grass.
  • NatureRenderPass: Renders rocks and grass.
  • SkyboxRenderPass: Renders the skybox.
  • Bloom: Post processes the image with bloom.

You may find that the app runs very slowly. On iPad mini 6, it runs at 35 - 40 FPS. This is mostly due to updating the skeleton walking animation and the quantity of grass. If your app runs too slowly, you can reduce the number of walkers in GameScene.

For reasons that will become clearer later, the Uniforms structure is now an array of buffers, initially just an array of one. The model matrix for each model is now in a separate array of buffers, similarly just an array of one, held on each model. Making this change from creating a byte buffer each draw call improved the frame rate significantly.

Profiling

There are a few ways to monitor and tweak your app’s performance. In this chapter, you’ll look at what Xcode has to offer in the way of profiling. You should also check out Instruments, which is a powerful app that profiles both CPU and GPU performance. For further information, read Apple’s article Analyzing the performance of your Metal app.

GPU History

GPU history is a tool provided by the macOS operating system via its Activity Monitor app, so it is not inside Xcode. It shows basic GPU activity in real time for all of your GPUs. If you’re using eGPUs, it’ll show activity in there too.

GPU History
CDO Wokribg

The GPU Report

➤ With your app running, in Xcode on the Debug navigator, click FPS.

The GPU report
Psa NVI qoyocg

GPU Workload Capture

In previous chapters, you captured the GPU workload to inspect textures, buffers and render passes. The GPU capture is always the first point of call for debugging. Make sure that your buffers and render passes are structured in the way that you think they are, and that they contain sensible information.

Summary

➤ With your app running, capture the GPU workload, and in the Debug navigator, click Summary.

The summary of your frame
Nza verqopp ak peuf kkosa

.worldTangent = model->normalMatrix * in.tangent,
.worldBitangent = model->normalMatrix * in.bitangent,
Possible memory issues
Baxnekku hewoyr imfaov

Bandwidth issues
Fubtpuzxg aqduih

descriptor?.depthAttachment.storeAction = .dontCare

The Shader Profiler

The shader profiler is perhaps the most useful profiling tool for the shader code you write. It has nothing to do with the rendering code the CPU is setting up, or the passes you run or the resources you’re sending to the GPU. This tool tells you how your MSL code is performing line-by-line and how long it took to finish.

The shader profiler
Ble dtizam ndahixiz

Command Information
Cumyimh Omvanjajiuv

constant half3 sunlight = half3(2, 4, -4);

fragment half4 fragment_nature(
  VertexOut in [[stage_in]],
  texture2d_array<float> baseColorTexture [[texture(0)]],
  constant Params &params [[buffer(ParamsBuffer)]])
{
  constexpr sampler s(
    filter::linear,
    address::repeat,
    mip_filter::linear,
    max_anisotropy(8));
  half4 baseColor = half4(baseColorTexture.sample(s, in.uv, in.textureID));
  half3 normal = half3(normalize(in.worldNormal));

  half3 lightDirection = normalize(sunlight);
  half diffuseIntensity = saturate(dot(lightDirection, normal));
  half4 color = mix(baseColor*0.5, baseColor*1.5, diffuseIntensity);
  return color;
}
Reload Shaders
Qusoug Lmevezd

GPU Timeline

The GPU timeline tool gives you an overview of how your vertex, fragment and compute functions perform, broken down by render pass.

Capture the GPU workload
Deymasa xbi FRA hugnxeoz

Render Passes
Mewtum Ruqbiz

The GPU timeline
Lna BVO yezepexe

GPU counters
GYA woesyeqf

static var cullFaces = true
Face culling implemented
Xuze zipmuxd ubdlupepxil

Memory

➤ In the Debug navigator, click the Memory tool (below Performance) to see the total memory used and how the various resources are allocated in memory:

Resources in memory
Sutueyzay ug kiniby

Instancing

Currently, you load fifteen skeleton walker meshes and four barrel meshes and draw them independently. Reducing the number of draw calls is one of the best ways of improving performance. If you render the same mesh multiple times, you should be using instanced draws, rather than drawing each mesh separately.

The Procedural Nature System

Using homeomorphic models, you can choose different shapes for each model. Homeomorphic is where two models use the same vertices in the same order, but the vertices are in different positions. A famous example of this is Spot the cow by Keenan Crane.

Spot by Keenan Crane
Wfej dx Wiufed Lmelu

Homeomorphic rocks
Yaheamazxfuy teqqn

 encoder.drawIndexedPrimitives(
   type: .triangle,
   indexCount: submesh.indexCount,
   indexType: submesh.indexType,
   indexBuffer: submesh.indexBuffer.buffer,
   indexBufferOffset: submesh.indexBuffer.offset,
   instanceCount: instanceCount)

Removing Duplicate Textures

Textures can use a lot memory, and you should always check that you use the appropriate size for the device. Most of your textures in this app are bundled with the USD files, but you could use separate texture files and the asset catalog can make this easy for you. If you need a refresher on how to use the asset catalog, Chapter 8, “Textures” has a section “The Right Texture for the Right Job”. However, you should also check that you aren’t duplicating textures.

The heap textures
Tca yiez nexbunug

name: NSString(string: property.textureName).lastPathComponent)
A reduced heap
E cecalos yuej

CPU-GPU Synchronization

Managing dynamic data can be a little tricky. Take the case of Uniforms, which is now stored in an MTLBuffer to help you understand synchronization. Uniforms contains only the camera, shadow and projection matrices, so you update it usually once per frame on the CPU. That means that the GPU should wait until the CPU has finished writing the buffer before it can read the buffer.

Triple Buffering

Triple buffering is a well-known technique in the realm of synchronization. The idea is to use three buffers at a time. While the CPU writes a later one in the pool, the GPU reads from the earlier one, thus preventing synchronization issues.

let maxFramesInFlight = 3
Self.currentFrameIndex =
  (Self.currentFrameIndex + 1) % maxFramesInFlight
Result of triple buffering
Vutakv aq zyumfu jihbunesr

Resource Contention
Zapuopvi Wacsumviap

commandBuffer.waitUntilCompleted()

Semaphores

A more performant way, is the use of a synchronization primitive known as a semaphore, which is a convenient way of keeping count of the available resources — your triple buffer in this case.

var semaphore: DispatchSemaphore
semaphore = DispatchSemaphore(value: maxFramesInFlight)
_ = semaphore.wait(timeout: .distantFuture)
commandBuffer.addCompletedHandler { _ in
  self.semaphore.signal()
}
commandBuffer.waitUntilCompleted()

Key Points

  • GPU History, in Activity Monitor, gives an overall picture of the performance of all the GPUs attached to your computer.
  • The GPU Report in Xcode shows you the frames per second that your app achieves. This should be 60 FPS for smooth running.
  • Capture the GPU workload for insight into what’s happening on the GPU. You can inspect buffers and be warned of possible errors or optimizations you can take. The shader profiler analyzes the time spent in each part of the shader functions. The performance profiler shows you a timeline of all your shader functions.
  • GPU counters show statistics and timings for every possible GPU function you can think of.
  • When you have multiple models using the same mesh, always perform instanced draw calls instead of rendering them separately.
  • Textures can have a huge effect on performance. Check your texture usage to ensure that you are using the correct size textures, and that you don’t send unnecessary resources to the GPU.
Have a technical question? Want to report a bug? You can ask questions and report bugs to the book authors in our official book forum here.
© 2025 Kodeco Inc.

You’re accessing parts of this content for free, with some sections shown as scrambled text. Unlock our entire catalogue of books and courses, with a Kodeco Personal Plan.

Unlock now