A Coding Implementation of a Comprehensive Enterprise AI Benchmarking Framework to Evaluate Rule-Based LLM, and Hybrid Agentic AI Systems Across Real-World Tasks

A Coding Implementation of a Comprehensive Enterprise AI Benchmarking Framework to Evaluate Rule-Based LLM, and Hybrid Agentic AI Systems Across Real-World Tasks

In this tutorial, we develop a comprehensive benchmarking framework to evaluate various types of agentic AI systems on real-world enterprise software tasks. We design a suite of diverse challenges, from data transformation and API integration to workflow automation and performance optimization, and assess how various agents, including rule-based, LLM-powered, and hybrid ones, perform across these domains. By running structured benchmarks and visualizing key performance metrics, such as accuracy, execution time, and success rate, we gain a deeper understanding of each agent’s strengths and trade-offs in enterprise environments. Check out the Full Codes here.

import json
import time
import random
from typing import Dict, List, Any, Callable
from dataclasses import dataclass, asdict
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


@dataclass
class Task:
   id: str
   name: str
   description: str
   category: str
   complexity: int
   expected_output: Any


@dataclass
class BenchmarkResult:
   task_id: str
   agent_name: str
   success: bool
   execution_time: float
   accuracy: float
   error_message: str = ""


class EnterpriseTaskSuite:
   def __init__(self):
       self.tasks = self._create_tasks()


   def _create_tasks(self) -> List[Task]:
       return [
           Task("data_transform", "CSV Data Transformation",
                "Transform customer data by aggregating sales", "data_processing", 3,
                {"total_sales": 15000, "avg_order": 750}),
           Task("api_integration", "REST API Integration",
                "Parse API response and extract key metrics", "integration", 2,
                {"status": "success", "active_users": 1250}),
           Task("workflow_automation", "Multi-Step Workflow",
                "Execute data validation -> processing -> reporting", "automation", 4,
                {"validated": True, "processed": 100, "report_generated": True}),
           Task("error_handling", "Error Recovery",
                "Handle malformed data gracefully", "reliability", 3,
                {"errors_caught": 5, "recovery_success": True}),
           Task("optimization", "Query Optimization",
                "Optimize database query performance", "performance", 5,
                {"execution_time_ms": 45, "rows_scanned": 1000}),
           Task("data_validation", "Schema Validation",
                "Validate data against business rules", "validation", 2,
                {"valid_records": 95, "invalid_records": 5}),
           Task("reporting", "Executive Dashboard",
                "Generate KPI summary report", "analytics", 3,
                {"revenue": 125000, "growth": 0.15, "customer_count": 450}),
           Task("integration_test", "System Integration",
                "Test end-to-end integration flow", "testing", 4,
                {"all_systems_connected": True, "latency_ms": 120}),
       ]


   def get_task(self, task_id: str) -> Task:
       return next((t for t in self.tasks if t.id == task_id), None)

We define the core data structures for our benchmarking system. We create the Task and BenchmarkResult data classes and initialize the EnterpriseTaskSuite, which holds multiple enterprise-relevant tasks such as data transformation, reporting, and integration. We laid the foundation for consistently evaluating different types of agents across these tasks. Check out the Full Codes here.

class BaseAgent:
   def __init__(self, name: str):
       self.name = name


   def execute(self, task: Task) -> Dict[str, Any]:
       raise NotImplementedError


class RuleBasedAgent(BaseAgent):
   def execute(self, task: Task) -> Dict[str, Any]:
       time.sleep(random.uniform(0.1, 0.3))
       if task.category == "data_processing":
           return {"total_sales": 15000 + random.randint(-500, 500),
                   "avg_order": 750 + random.randint(-50, 50)}
       elif task.category == "integration":
           return {"status": "success", "active_users": 1250}
       elif task.category == "automation":
           return {"validated": True, "processed": 98, "report_generated": True}
       else:
           return task.expected_output

We introduce the base agent structure and implement the RuleBasedAgent, which mimics traditional automation logic using predefined rules. We simulate how such agents execute tasks deterministically while maintaining speed and reliability, giving us a baseline for comparison with more advanced agents. Check out the Full Codes here.

class LLMAgent(BaseAgent):
   def execute(self, task: Task) -> Dict[str, Any]:
       time.sleep(random.uniform(0.2, 0.5))
       accuracy_boost = 0.95 if task.complexity >= 4 else 0.90
       result = {}
       for key, value in task.expected_output.items():
           if isinstance(value, (int, float)):
               variation = value * (1 - accuracy_boost)
               result[key] = value + random.uniform(-variation, variation)
           else:
               result[key] = value
       return result


class HybridAgent(BaseAgent):
   def execute(self, task: Task) -> Dict[str, Any]:
       time.sleep(random.uniform(0.15, 0.35))
       if task.complexity <= 2:
           return task.expected_output
       else:
           result = {}
           for key, value in task.expected_output.items():
               if isinstance(value, (int, float)):
                   variation = value * 0.03
                   result[key] = value + random.uniform(-variation, variation)
               else:
                   result[key] = value
           return result

We develop two intelligent agent types, the LLMAgent, representing reasoning-based AI systems, and the HybridAgent, which combines rule-based precision with LLM adaptability. We design these agents to show how learning-based methods improve task accuracy, especially for complex enterprise workflows. Check out the Full Codes here.

class BenchmarkEngine:
   def __init__(self, task_suite: EnterpriseTaskSuite):
       self.task_suite = task_suite
       self.results: List[BenchmarkResult] = []


   def run_benchmark(self, agent: BaseAgent, iterations: int = 3):
       print(f"n{'='*60}")
       print(f"Benchmarking Agent: {agent.name}")
       print(f"{'='*60}")
       for task in self.task_suite.tasks:
           print(f"nTask: {task.name} (Complexity: {task.complexity}/5)")
           for i in range(iterations):
               result = self._execute_task(agent, task, i+1)
               self.results.append(result)
               status = "✓ PASS" if result.success else "✗ FAIL"
               print(f"  Run {i+1}: {status} | Time: {result.execution_time:.3f}s | Accuracy: {result.accuracy:.2%}")

Here, we build the core of our benchmarking engine, which manages agent evaluation across the defined task suite. We implement methods to run each agent multiple times per task, log results, and measure key parameters like execution time and accuracy. This creates a systematic and repeatable benchmarking loop. Check out the Full Codes here.

 def _execute_task(self, agent: BaseAgent, task: Task, run_num: int) -> BenchmarkResult:
       start_time = time.time()
       try:
           output = agent.execute(task)
           execution_time = time.time() - start_time
           accuracy = self._calculate_accuracy(output, task.expected_output)
           success = accuracy >= 0.85
           return BenchmarkResult(task_id=task.id, agent_name=agent.name, success=success,
                                  execution_time=execution_time, accuracy=accuracy)
       except Exception as e:
           execution_time = time.time() - start_time
           return BenchmarkResult(task_id=task.id, agent_name=agent.name, success=False,
                                  execution_time=execution_time, accuracy=0.0, error_message=str(e))


   def _calculate_accuracy(self, output: Dict, expected: Dict) -> float:
       if not output:
           return 0.0
       scores = []
       for key, expected_val in expected.items():
           if key not in output:
               scores.append(0.0)
               continue
           actual_val = output[key]
           if isinstance(expected_val, bool):
               scores.append(1.0 if actual_val == expected_val else 0.0)
           elif isinstance(expected_val, (int, float)):
               diff = abs(actual_val - expected_val)
               tolerance = abs(expected_val * 0.1)
               score = max(0, 1 - (diff / (tolerance + 1e-9)))
               scores.append(score)
           else:
               scores.append(1.0 if actual_val == expected_val else 0.0)
       return np.mean(scores) if scores else 0.0

We define the task execution logic and the accuracy computation. We measure each agent’s performance by comparing their outputs against expected results using a scoring mechanism. This step ensures our benchmarking process is quantitative and fair, providing insights into how closely agents align with business expectations. Check out the Full Codes here.

 def generate_report(self):
       df = pd.DataFrame([asdict(r) for r in self.results])
       print(f"n{'='*60}")
       print("BENCHMARK REPORT")
       print(f"{'='*60}n")
       for agent_name in df['agent_name'].unique():
           agent_df = df[df['agent_name'] == agent_name]
           print(f"{agent_name}:")
           print(f"  Success Rate: {agent_df['success'].mean():.1%}")
           print(f"  Avg Execution Time: {agent_df['execution_time'].mean():.3f}s")
           print(f"  Avg Accuracy: {agent_df['accuracy'].mean():.2%}n")
       return df


   def visualize_results(self, df: pd.DataFrame):
       fig, axes = plt.subplots(2, 2, figsize=(14, 10))
       fig.suptitle('Enterprise Agent Benchmarking Results', fontsize=16, fontweight='bold')
       success_rate = df.groupby('agent_name')['success'].mean()
       axes[0, 0].bar(success_rate.index, success_rate.values, color=['#3498db', '#e74c3c', '#2ecc71'])
       axes[0, 0].set_title('Success Rate by Agent', fontweight='bold')
       axes[0, 0].set_ylabel('Success Rate')
       axes[0, 0].set_ylim(0, 1.1)
       for i, v in enumerate(success_rate.values):
           axes[0, 0].text(i, v + 0.02, f'{v:.1%}', ha='center', fontweight='bold')
       time_data = df.groupby('agent_name')['execution_time'].mean()
       axes[0, 1].bar(time_data.index, time_data.values, color=['#3498db', '#e74c3c', '#2ecc71'])
       axes[0, 1].set_title('Average Execution Time', fontweight='bold')
       axes[0, 1].set_ylabel('Time (seconds)')
       for i, v in enumerate(time_data.values):
           axes[0, 1].text(i, v + 0.01, f'{v:.3f}s', ha='center', fontweight='bold')
       df.boxplot(column='accuracy', by='agent_name', ax=axes[1, 0])
       axes[1, 0].set_title('Accuracy Distribution', fontweight='bold')
       axes[1, 0].set_xlabel('Agent')
       axes[1, 0].set_ylabel('Accuracy')
       plt.sca(axes[1, 0])
       plt.xticks(rotation=15)
       task_complexity = {t.id: t.complexity for t in self.task_suite.tasks}
       df['complexity'] = df['task_id'].map(task_complexity)
       complexity_perf = df.groupby(['agent_name', 'complexity'])['accuracy'].mean().unstack()
       complexity_perf.plot(kind='line', ax=axes[1, 1], marker='o', linewidth=2)
       axes[1, 1].set_title('Accuracy by Task Complexity', fontweight='bold')
       axes[1, 1].set_xlabel('Task Complexity')
       axes[1, 1].set_ylabel('Accuracy')
       axes[1, 1].legend(title='Agent', loc='best')
       axes[1, 1].grid(True, alpha=0.3)
       plt.tight_layout()
       plt.show()


if __name__ == "__main__":
   print("Enterprise Software Benchmarking for Agentic Agents")
   print("="*60)
   task_suite = EnterpriseTaskSuite()
   benchmark = BenchmarkEngine(task_suite)
   agents = [RuleBasedAgent("Rule-Based Agent"), LLMAgent("LLM Agent"), HybridAgent("Hybrid Agent")]
   for agent in agents:
       benchmark.run_benchmark(agent, iterations=3)
   results_df = benchmark.generate_report()
   benchmark.visualize_results(results_df)
   results_df.to_csv('agent_benchmark_results.csv', index=False)
   print("nResults exported to: agent_benchmark_results.csv")

We generate detailed reports and create visual analytics for performance comparison. We analyze metrics such as success rate, execution time, and accuracy across agents and task complexities. Finally, we export the results to CSV file, completing a full enterprise-grade evaluation workflow.

In conclusion, we implemented a robust, extensible benchmarking system that enables us to measure and compare the efficiency, adaptability, and accuracy of multiple agentic AI approaches. We observed how different architectures excel at different levels of task complexity and how visual analytics highlight performance trends. This process enables us to evaluate existing agents and provides a strong foundation for next-generation enterprise AI agents, optimized for reliability and intelligence.


Check out the Full Codes here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

The post A Coding Implementation of a Comprehensive Enterprise AI Benchmarking Framework to Evaluate Rule-Based LLM, and Hybrid Agentic AI Systems Across Real-World Tasks appeared first on MarkTechPost.