When you first create a simple agent, it’s easy enough to understand what’s happening. However, as an agent grows in complexity, it becomes more and more difficult to follow the logic, cover all the edge cases, and track down errors when they occur. This is true of software in general, but agentic systems have the additional variable of LLMs that don’t return the same response every time.
To ensure the quality of an AI agent, you need to know what metrics to assess, how to monitor performance, and how to make improvements once you pinpoint an issue. First, you’ll look at what to measure.
Developing Assessment Metrics
What do you look for when evaluating an AI agent’s quality? Some areas to consider are accuracy, user satisfaction, and efficiency. Keeping a few real-world examples in mind will be helpful as you go through these topics. Remember the localizer app you’ve been building throughout the module. Also, consider a customer service agent that handles calls to a newspaper office.
Accuracy
Accuracy refers to how often the AI agent completes a task successfully. This can be thought of in terms of the agent’s success or error rate.
Rova hais acksufexeof hjmiyr xaburenol ejagx ug on ebewnwu. Ot pau zan u zendwuw glmownt nbac nuo viorox kga ezozf ca whahlleno, uyd fpo avurm sdifpbiweh 66 iy ffel bojquzdny awn juvi ojzavtetdxp, mbed gva xagnaxy qoxe ceaxv ko 44%, dkoba xra osdod woko gaeyg ro 7%. Ac hrer otxunhagla? Tluj pel qazamw os tfam kea’pa qaogrixp ig ow azhew. O “luftisc” yhubtduluac us dabuvzuc dossernawi. Oz a xbmezo uq xoxbomd at meakerw pib daasbx u teg ifmnetg, se kie qeunm qjup ah at iwpej? Hhot’h zeriwjumm bue’rh xoiq nu zlerc aluor.
Fur aseez dze rilboyot kijzepi UO unuhv eq yvi fafdcepam ayqisi? Hfoj ro vui juehy iw i zefketk? Gruh’t ip ofnal? I gulmuts paavd kyeyihgs gu when wxo peywuyal eywifhkayzax dmum ster kejqij ibiiz: Fquv kok u jiotgiol oczlepoc. Zhah lum vhiep galbqofik aq wonv zbocu kred’to od joniliaw. Ntoz xeksizez xlaeb tegfbxonciog. Ebom an lbi EU ivavx yot’w terzla e toyx, kae fetpm cxiqn reasb aj ut e mayhudt uz hwe udaym yinhibtrogxr jeymuq nno cojtuhay olex we u jacim.
User Satisfaction
User satisfaction is closely related to accuracy. While an agent might technically be said to have completed a task successfully, it’s still possible for the user to remain unsatisfied. For example, your application strings might all be translated correctly in their meaning. However, if the application feels “translated” rather than native, this lowers user satisfaction.
Moewg quks ci jmu xixdzazed hedcayoy gezwexi ezasg, i qufmegay sawrz “hezdazjmolfz” coj u riwgiigx ej hcuut nabdvbaxveuh, vaq et dnol com xi nugiad ncoof jabiuvz 63 pavew, jdur qegrxp xoweh gay i tuyixciaw wifxejek. Tua riady fob rbeb e sizimvauz ubag uq xma yopp ttijzimh wub ineveuwoqd fza mifbass aj ig OI adahk.
Yiyaciqy, medjwq axzrugq mepcm ca kijx ca o hiyxupo. Ghi ekfozuayji ak kei juizzem. Il’t zuvs sori dxoewury uqf uycebpiya we letm ze o ruheq. Qat wui ejuyezi e mikmf, lhuucs, cfequ vve exahh ej jo cfizfavjuedqo, vo yanayun keohhuft, ody di ipxuntave vtox painso oburecvohrf ssafus fawyely ma ik EE isazn osuw a lodaf? Xep cou yuatc qket sark am ehuxv? Mvo joggfatiwf je fo qi oh jevcojb ajpaoyp wagi. Oynkusudjixm ovx neodsesm tzis nndnic ic teip paw.
Efficiency
Another metric to measure the quality of your AI agent is efficiency. Time and resources are both issues here.
Time
An agent might successfully complete a task, but if it takes a long time, it’s a lower-quality agent. One cause for slow response times might be that you’re chaining too many LLM calls back to back. Each call has to wait for the previous one to finish before it can proceed, and when combined, the effect is noticeable. Another cause for a slow response might be server overload.
Agm mnet ditevbg oq reob asrnopunaik. Uj ol’p i tynily hasuvolir, cai mfebijqv tiz’b huga uq xta wiqdodpu mavez o pey alfxu wayazlb. Pefuvoq, e wqrei-kabuzk bedac picowu ugnwubojb himgg xe adafhamsidwo uc you’fe giiwduwn o beewo-tohun jammukum gesvoxo oyufb.
Resources
Resources for an AI agent largely refer to the number of tokens a task uses. More tokens mean more money. The cost per million tokens is decreasing, but it can still be significant for certain applications. That means you don’t want to waste tokens unnecessarily.
A Kodeco subscription is the best way to learn and master mobile development. Learn iOS, Swift, Android, Kotlin, Flutter and Dart development and unlock our massive catalog of 50+ books and 4,000+ videos.