Back to Blog

Why the RCS AI Agent Testing Gap Is the Next Major Enterprise Opportunity

Every RCS AI agent demo works. Almost no production system works the same way two weeks after launch.

The MCP SDK has crossed 97 million monthly downloads. There are 5,800+ MCP servers in the wild. Enterprise teams are building RCS AI agents, launching them, and watching them silently degrade in production — not because the AI model failed, but because the testing infrastructure was never there to begin with.

This isn't a model problem. It's a testing infrastructure problem. And it's the next major opportunity for enterprise teams who get ahead of it.

Why Standard Chatbot Testing Fails RCS

If you've tested a standard chatbot, you know the drill: send a message, check the response, validate the JSON, mark it passing. Tools like Botium or even homemade curl scripts can cover most of the flow in an afternoon.

RCS doesn't work that way. The channel has a four-axis render matrix that standard testing tools can't see:

  • Device OEM — Samsung, Google Pixel, Xiaomi, Motorola all render Rich Cards differently
  • Carrier profile — carrier-specific firmware affects how suggested actions display
  • UP level — Universal Profile version determines which RCS features are available (UP 1.0 vs 2.0 vs 3.0)
  • OS version — Android version changes layout behavior for carousels and chip bars

SMS/MMS testing tools don't capture any of this. You can validate that your API returns a correctly structured rich card message and still ship a broken experience — because on a Samsung Galaxy S24 running Android 15 with carrier firmware XL-2024, the card renders with the wrong aspect ratio and your CTA button disappears.

This is the "silent failure" problem: your monitoring shows delivery confirmed, but users see a broken layout. You don't find out until tickets start coming in.

Standard QA doesn't have a category for this. Device-level render validation — automated screenshot comparison across a hardware device matrix — isn't part of any off-the-shelf RCS testing workflow. It's the missing layer.

MCP Tool Descriptions Are Your Agent's Real Code

Here's the part nobody talks about in AI agent planning sessions: the MCP tool descriptions determine your agent's production behavior more than the model does.

MCP tool descriptions are the metadata that tells your AI model what each tool does, when to call it, and how to interpret the response. They're treated as boilerplate. They're written once at project start and forgotten.

But here's what Apigene's research found: bad MCP tool descriptions cause a 23% misroute rate — meaning nearly one in four tool calls goes to the wrong handler. After rewriting the tool descriptions with better specificity and channel-awareness, that rate dropped to under 5%.

That's a 5x improvement. Not from changing the model. From changing the documentation.

The token bloat problem compounds this. Loading 10+ MCP servers with verbose tool definitions can consume 30–50% of your context window before the first user message arrives. Your expensive model is spending most of its context budget reading about tools that will never be called in this session.

For RCS specifically, this matters enormously. Suggested actions, carousels, Rich Cards — these are all MCP tool outputs. When your tool descriptions are vague, the model doesn't just misroute infrastructure commands. It generates wrong Rich Card layouts, suggests the wrong actions, and builds broken conversation flows that look fine in your development environment.

The fix is straightforward but almost never implemented: audit your MCP tool descriptions on the same schedule as your model updates. Test routing accuracy as a first-class metric. Treat tool descriptions as production code, not documentation.

The Load Balancer Problem — Session State at Scale

MCP-based RCS agents are session-oriented by design. A conversation between a user and an AI agent has state — context that carries across turns, decisions that reference earlier steps, a running thread of intent.

Production infrastructure doesn't like state.

Every reverse proxy, load balancer, and API gateway in a standard enterprise deployment assumes stateless request handling. A request comes in, gets routed, response goes out. Nothing preserved. This works fine for REST APIs. It fights you relentlessly when you're trying to maintain a session across multiple turns of an AI conversation.

SSE (Server-Sent Events) transport was the original MCP transport, but it was designed for a world where long-lived connections were normal. Modern load balancers — AWS ALB, Nginx, Cloudflare — treat long-lived connections as anomalies and apply aggressive idle timeouts. The result: your agent mid-conversation silently disconnects, session state is lost, and the user starts over.

Streamable HTTP is the current transport recommendation, but it doesn't fully solve the session affinity problem. Teams running MCP at scale have converged on three patterns:

  1. Sticky sessions — route all requests from a given session to the same server instance. Works, but loses horizontal scalability benefits.
  2. Dedicated instance routing — reserve a server instance for session-bound traffic. Better reliability, higher cost.
  3. Stateless session reconstruction — store session state in an external store (Redis, DynamoDB) and reconstruct on each request. Most scalable, most complex.

The JWKS caching pattern helps here. Token validation without caching takes 50–200ms per request. With JWKS caching, that drops to under 1ms. For a session that spans 15–20 turns of conversation, that's seconds of accumulated latency eliminated.

For RCS agents specifically, session continuity matters more than most channels. Users who are mid-conversation with your brand agent expect the thread to persist. An interruption mid-conversation — "I'm sorry, I didn't understand that" from a fresh session — breaks trust more severely than a delayed first response.

The Three-Pillar RCS AI Agent Testing Framework

Teams that have solved this don't have better AI models. They have better testing infrastructure. Here's the framework they use:

Pillar 1: Device-Level Render Validation

Hardware device labs are a prerequisite, not a luxury. A minimum viable device matrix for RCS covers five to eight key combinations: Samsung (flagship + mid-range), Google Pixel (latest + one generation back), Xiaomi or Motorola for broader Android coverage, and carrier-specific profiles where you have traffic.

Automated screenshot comparison catches render issues before they ship. Tools like Applitools or even custom Playwright setups can capture Rich Card and suggested action renders across your device matrix and flag regressions automatically.

The key shift: device validation is part of your CI pipeline, not a pre-launch checklist. Every commit that changes message templates runs the render matrix before merge.

Pillar 2: MCP Tool Description Audit

Build a quality rubric for your tool descriptions: specificity (does it describe exactly when to call this tool?), completeness (are all parameters and return types documented?), and channel-awareness (does it account for RCS-specific constraints like character limits on suggested actions?).

Measure misroute rate as a first-class metric. Run A/B tests on tool descriptions and track routing accuracy over a two-week window. If your current misroute rate is above 5%, tool description optimization is your highest-leverage improvement.

Token budget optimization: profile your MCP server load and measure what percentage of context is consumed by tool definitions. If it's above 20%, you've got room to compress.

Pillar 3: Production-Scale Session Testing

Concurrency testing at realistic load levels before every major deployment. If you're targeting 10,000 concurrent conversations, test at 12,000 before launch.

Session persistence validation: kill a server instance mid-conversation and verify that session state reconstructs correctly on the next request. If it doesn't, you've found a race condition that will only surface at scale.

Load balancer compatibility testing in staging with the same ALB/Nginx/CF configuration you'll run in production. Configuration differences between environments are the most common cause of session issues that never appear in development.

What Winning Teams Do Differently

The enterprises that have solved the RCS AI agent testing gap share five habits:

  1. Treat testing infrastructure as a production system — not a QA department responsibility. The teams with the best outcomes have engineers dedicated to the testing platform itself, not just engineers using it.

  2. Run device-level validation in CI, not pre-launch — every change that touches message structure or suggested actions triggers the full render matrix automatically.

  3. Audit MCP tool descriptions on the same schedule as model updates — it's the most cost-effective optimization most teams are leaving on the table.

  4. Test at realistic concurrency before every production deployment — find out if your load balancer configuration works at scale in staging, not on a Friday night.

  5. Measure session continuity as a first-class metric — not just "did the message deliver?" but "did the conversation stay intact across 20 turns?"

One enterprise team we worked with reduced their RCS agent debug cycles from six weeks to four days after building the three-pillar testing framework. The AI model didn't change. The testing infrastructure did.

The Opportunity — Testing Infrastructure as Competitive Moat

RCS adoption is accelerating. Infobip's 2026 data shows 3x RCS growth. North America saw 70x traffic growth year-over-year. The channel is crossing the early-adopter chasm into mainstream enterprise.

When a channel crosses into mainstream, the teams that win aren't the ones with the best AI models. They're the ones with the best operational infrastructure — the teams that can ship faster, debug faster, and iterate faster because their testing foundation is solid.

The A2P 10DLC trajectory is instructive. The teams that invested in compliance infrastructure early — carrier registration, brand verification, campaign testing pipelines — captured disproportionate market share as the ecosystem scaled. RCS is at the 2020 moment in that trajectory.

RCS X is built to address the testing gap that generic chatbot platforms ignore. The RCS agent testing infrastructure you're building today is the same foundation that lets you move faster than competitors who are still running demo-grade QA.

The window to build this moat is open now. The teams that start building their device matrix, their tool description audit process, and their session testing infrastructure today will be the ones who set the standard for enterprise RCS quality in 2027.

Evaluate your current RCS AI agent testing approach against the three-pillar framework. If you're missing any of the three pillars, that's your highest-leverage investment.


Sources: