How to Build a Vision-Guided Web AI Agent with MolmoWeb-4B Using Multimodal Reasoning and Action Prediction

How to Build a Vision-Guided Web AI Agent with MolmoWeb-4B Using Multimodal Reasoning and Action Prediction

In this tutorial, we explore MolmoWeb, Ai2’s open multimodal web agent that understands and interacts with websites directly from screenshots, without relying on HTML or DOM parsing. We set up the full environment in Colab, load the MolmoWeb-4B model with efficient 4-bit quantization, and build the exact prompting workflow that lets the model reason about a web task and predict browser actions. Also, we test the model on blank pages, synthetic web screenshots, and multi-step browsing scenarios to understand how screenshot-based web agents actually think, act, and maintain context across steps.

print("=" * 70)
print("SECTION 1: Installing dependencies...")
print("=" * 70)


import subprocess, sys


def pip_install(*packages):
   subprocess.check_call(
       [sys.executable, "-m", "pip", "install", "-q"] + list(packages)
   )


pip_install(
   "transformers>=4.48.0",
   "accelerate",
   "bitsandbytes",
   "jinja2",
   "Pillow",
   "requests",
   "datasets",
   "matplotlib",
   "torch",
)


import torch
import re
import json
import textwrap
from PIL import Image, ImageDraw, ImageFont
import requests
from io import BytesIO
from jinja2 import Template
import matplotlib.pyplot as plt
import matplotlib.patches as patches
from transformers import AutoProcessor, AutoModelForImageTextToText, BitsAndBytesConfig


print(f"PyTorch {torch.__version__}  |  CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
   print(f"   GPU: {torch.cuda.get_device_name(0)}")
   mem_gb = torch.cuda.get_device_properties(0).total_mem / 1e9
   print(f"   VRAM: {mem_gb:.1f} GB")




print("n" + "=" * 70)
print("SECTION 2: Loading MolmoWeb-4B model...")
print("=" * 70)


CHECKPOINT = "allenai/MolmoWeb-4B"


QUANTIZE = True


if QUANTIZE:
   print("Using 4-bit NF4 quantization (fits ~6 GB VRAM)")
   bnb_config = BitsAndBytesConfig(
       load_in_4bit=True,
       bnb_4bit_quant_type="nf4",
       bnb_4bit_compute_dtype=torch.bfloat16,
       bnb_4bit_use_double_quant=True,
   )
   model = AutoModelForImageTextToText.from_pretrained(
       CHECKPOINT,
       trust_remote_code=True,
       quantization_config=bnb_config,
       device_map="auto",
   )
else:
   print("Loading in full bfloat16 precision")
   model = AutoModelForImageTextToText.from_pretrained(
       CHECKPOINT,
       trust_remote_code=True,
       torch_dtype=torch.bfloat16,
       device_map="auto",
   )


processor = AutoProcessor.from_pretrained(
   CHECKPOINT,
   trust_remote_code=True,
   padding_side="left",
)


print(f"Model loaded: {CHECKPOINT}")
print(f"   Device map: {model.hf_device_map if hasattr(model, 'hf_device_map') else 'single device'}")

We set up the entire environment by installing all required dependencies and importing the core libraries needed for the tutorial. We ensure the runtime is properly configured for GPU usage and verify CUDA availability and device details. By the end of this step, we will have established a stable foundation for running MolmoWeb efficiently in Colab.

print("n" + "=" * 70)
print("SECTION 3: Understanding the prompt template & action space")
print("=" * 70)


MOLMOWEB_THINK_TEMPLATE = Template("""
# GOAL
{{ task_description }}


# PREVIOUS STEPS
{% for action in past_actions -%}
## Step {{ action['index'] }}
THOUGHT: {{ action['thought'] }}
ACTION: {{ action['action'] }}
{% endfor %}
# CURRENTLY ACTIVE PAGE
Page {{ page_index }}: {{ page_title }} | {{ page_url }}


# NEXT STEP


""")


SYSTEM_MESSAGE = "molmo_web_think"


print("""
MolmoWeb Action Space:
 goto(url)        - Navigate to a URL
 click(x, y)      - Click at normalised coordinates (0.0-1.0)
 type("text")     - Type text into focused element
 scroll(dir)      - Scroll the page (up/down)
 press("key")     - Press a key (Enter, Tab, etc.)
 new_tab()        - Open a new tab
 switch_tab(n)    - Switch to tab n
 go_back()        - Navigate back
 send_msg("text") - Reply to the user with an answer
""")




print("=" * 70)
print("SECTION 4: Defining helper functions")
print("=" * 70)




def build_prompt(task_description, past_actions=None, page_title=None,
                page_url="about:blank", page_index=0):
   """Build the full MolmoWeb prompt from components."""
   if past_actions is None:
       past_actions = []
   user_message = MOLMOWEB_THINK_TEMPLATE.render(
       task_description=task_description,
       past_actions=past_actions,
       page_title=page_title,
       page_url=page_url,
       page_index=page_index,
   )
   return f"{SYSTEM_MESSAGE}: {user_message}"




def run_inference(prompt, image, max_new_tokens=300):
   """Run a single forward pass through MolmoWeb and return decoded text."""
   messages = [
       {
           "role": "user",
           "content": [
               {"type": "text", "text": prompt},
               {"type": "image", "image": image},
           ],
       }
   ]
   inputs = processor.apply_chat_template(
       messages,
       tokenize=True,
       add_generation_prompt=True,
       return_tensors="pt",
       return_dict=True,
       padding=True,
   )
   inputs = {k: v.to(model.device) for k, v in inputs.items()}


   with torch.inference_mode(), torch.autocast("cuda", dtype=torch.bfloat16):
       output = model.generate(**inputs, max_new_tokens=max_new_tokens)


   generated_tokens = output[0, inputs["input_ids"].size(1):]
   return processor.decode(generated_tokens, skip_special_tokens=True)




def parse_thought_and_action(raw_output):
   """
   Parse MolmoWeb output into thought and action components.


   MolmoWeb outputs typically look like:
       THOUGHT: I need to navigate to arxiv.org to find the paper.
       ACTION: goto("https://arxiv.org")


   Returns a dict with 'thought' and 'action' keys.
   """
   thought = ""
   action = ""


   thought_match = re.search(r"THOUGHT:s*(.+?)(?=nACTION:|Z)", raw_output, re.DOTALL)
   action_match = re.search(r"ACTION:s*(.+?)(?=n|$)", raw_output, re.DOTALL)


   if thought_match:
       thought = thought_match.group(1).strip()
   if action_match:
       action = action_match.group(1).strip()


   if not thought and not action:
       lines = raw_output.strip().split("n")
       if len(lines) >= 2:
           thought = lines[0].strip()
           action = lines[-1].strip()
       else:
           thought = raw_output.strip()


   return {"thought": thought, "action": action}

We load the MolmoWeb-4B model with 4-bit quantization to fit within the memory constraints of a free-tier GPU. We configure the model with BitsAndBytes for efficient inference and initialize the processor required for multimodal inputs. This step ensures that the model is ready to accept both text prompts and screenshot inputs for web agent reasoning.

def parse_click_coords(action_str):
   """
   Extract normalised (x, y) coordinates from a click action string.
   e.g., 'click(0.45, 0.32)' -> (0.45, 0.32)
   Returns None if the action is not a click.
   """
   match = re.search(r"click(s*([d.]+)s*,s*([d.]+)s*)", action_str)
   if match:
       return float(match.group(1)), float(match.group(2))
   return None




def parse_action_details(action_str):
   """
   Parse a MolmoWeb action string into a structured dict.
   Returns:  {"type": "click", "x": 0.45, "y": 0.32}
             {"type": "goto", "url": "https://..."}
             {"type": "type", "text": "query text"}
             {"type": "scroll", "direction": "down"}
             {"type": "press", "key": "Enter"}
             {"type": "send_msg", "message": "The answer is ..."}
             {"type": "unknown", "raw": "..."}
   """
   action_str = action_str.strip()


   m = re.match(r'click(s*([d.]+)s*,s*([d.]+)s*)', action_str)
   if m:
       return {"type": "click", "x": float(m.group(1)), "y": float(m.group(2))}


   m = re.match(r'goto(s*["'](.+?)["']s*)', action_str)
   if m:
       return {"type": "goto", "url": m.group(1)}


   m = re.match(r'type(s*["'](.+?)["']s*)', action_str)
   if m:
       return {"type": "type", "text": m.group(1)}


   m = re.match(r'scroll(s*["']?(up|down)["']?s*)', action_str)
   if m:
       return {"type": "scroll", "direction": m.group(1)}


   m = re.match(r'press(s*["'](.+?)["']s*)', action_str)
   if m:
       return {"type": "press", "key": m.group(1)}


   m = re.match(r'send_msg(s*["'](.+?)["']s*)', action_str, re.DOTALL)
   if m:
       return {"type": "send_msg", "message": m.group(1)}


   m = re.match(r'(new_tab|go_back|switch_tab)(s*(d*)s*)', action_str)
   if m:
       result = {"type": m.group(1)}
       if m.group(2):
           result["tab"] = int(m.group(2))
       return result


   return {"type": "unknown", "raw": action_str}




def visualise_click(image, action_str, title="MolmoWeb Prediction"):
   """
   Draw the predicted click location on the screenshot and display it.
   Coordinates are normalised (0-1); we convert to pixel space.
   """
   coords = parse_click_coords(action_str)


   fig, ax = plt.subplots(1, 1, figsize=(12, 7))
   ax.imshow(image)
   ax.set_title(title, fontsize=14)


   if coords:
       x_norm, y_norm = coords
       w, h = image.size
       x_px, y_px = x_norm * w, y_norm * h


       circle = patches.Circle(
           (x_px, y_px), radius=18, linewidth=3,
           edgecolor="red", facecolor="none"
       )
       ax.add_patch(circle)
       ax.plot(x_px, y_px, "r+", markersize=20, markeredgewidth=3)


       ax.annotate(
           f"click({x_norm:.3f}, {y_norm:.3f})",
           (x_px, y_px), xytext=(x_px + 25, y_px - 25),
           fontsize=11, color="white",
           bbox=dict(boxstyle="round,pad=0.3", facecolor="red", alpha=0.8),
           arrowprops=dict(arrowstyle="->", color="red", lw=2),
       )
   else:
       ax.text(
           0.5, 0.02, f"Action: {action_str}", transform=ax.transAxes,
           fontsize=12, ha="center", color="white",
           bbox=dict(boxstyle="round,pad=0.4", facecolor="blue", alpha=0.8),
       )


   ax.axis("off")
   plt.tight_layout()
   plt.show()




def download_image(url, size=(1280, 720)):
   """Download an image from a URL and resize to browser viewport dimensions."""
   response = requests.get(url, timeout=15)
   img = Image.open(BytesIO(response.content)).convert("RGB")
   img = img.resize(size, Image.LANCZOS)
   return img




def create_synthetic_webpage(title="Example Page", elements=None):
   """
   Create a synthetic webpage screenshot for testing.
   'elements' is a list of dicts: {"type": "button"|"input"|"text"|"link",
                                    "text": str, "pos": (x, y)}
   """
   img = Image.new("RGB", (1280, 720), color=(255, 255, 255))
   draw = ImageDraw.Draw(img)


   draw.rectangle([0, 0, 1280, 50], fill=(240, 240, 240))
   draw.rectangle([180, 10, 900, 40], outline=(200, 200, 200), width=1, fill="white")
   draw.text((200, 16), f"https://www.example.com", fill=(100, 100, 100))


   for cx in [30, 60, 90]:
       draw.ellipse([cx - 8, 17, cx + 8, 33], fill=(200, 200, 200))


   draw.text((50, 70), title, fill="black")


   if elements:
       for el in elements:
           x, y = el["pos"]
           if el["type"] == "button":
               draw.rectangle([x, y, x + 150, y + 35], fill=(66, 133, 244))
               draw.text((x + 10, y + 8), el["text"], fill="white")
           elif el["type"] == "input":
               draw.rectangle([x, y, x + 300, y + 35], outline=(180, 180, 180), width=2)
               draw.text((x + 10, y + 8), el["text"], fill=(150, 150, 150))
           elif el["type"] == "text":
               draw.text((x, y), el["text"], fill="black")
           elif el["type"] == "link":
               draw.text((x, y), el["text"], fill=(66, 133, 244))


   return img




print("Helper functions defined successfully.")




print("n" + "=" * 70)
print("SECTION 5: Single-step inference - blank page (cold start)")
print("=" * 70)
print("The agent starts at about:blank and must decide its first action.n")


blank_image = Image.new("RGB", (1280, 720), color="white")


task = "Go to arxiv.org and find the latest paper about Molmo from Ai2"


prompt = build_prompt(
   task_description=task,
   page_url="about:blank",
   page_index=0,
)


print(f"Task: {task}")
print("Screenshot: blank white image (about:blank)")
print("Running inference...n")


raw_output = run_inference(prompt, blank_image)


print(f"Raw model output:n{raw_output}n")


parsed = parse_thought_and_action(raw_output)
print(f"Thought: {parsed['thought']}")
print(f"Action:  {parsed['action']}")


action_details = parse_action_details(parsed["action"])
print(f"Parsed:  {action_details}")

We define the structured prompt template and system message that guide the model’s reasoning and action generation. We clearly establish how tasks, past actions, and current page context are formatted before being sent to the model. This forms the core interface that allows MolmoWeb to behave like a step-by-step web agent.

print("n" + "=" * 70)
print("SECTION 6: Single-step inference - webpage screenshot")
print("=" * 70)


search_page = create_synthetic_webpage(
   title="Google",
   elements=[
       {"type": "text", "text": "Google", "pos": (560, 200)},
       {"type": "input", "text": "Search Google or type a URL", "pos": (390, 340)},
       {"type": "button", "text": "Google Search", "pos": (490, 400)},
       {"type": "button", "text": "I'm Feeling Lucky", "pos": (660, 400)},
   ]
)


task_search = "Search Google for 'MolmoWeb Ai2 open source web agent'"


prompt_search = build_prompt(
   task_description=task_search,
   page_title="Google",
   page_url="https://www.google.com",
   page_index=1,
   past_actions=[
       {
           "index": 1,
           "thought": "I need to go to Google to perform a search.",
           "action": 'goto("https://www.google.com")',
       }
   ],
)


print(f"Task: {task_search}")
print("Screenshot: synthetic Google search page")
print("Running inference...n")


raw_search = run_inference(prompt_search, search_page)


print(f"Raw model output:n{raw_search}n")


parsed_search = parse_thought_and_action(raw_search)
print(f"Thought: {parsed_search['thought']}")
print(f"Action:  {parsed_search['action']}")


visualise_click(search_page, parsed_search["action"], title="MolmoWeb -> Google Search")




print("n" + "=" * 70)
print("SECTION 7: Multi-step agent loop (simulated)")
print("=" * 70)
print("""
In production, MolmoWeb runs in a loop:
 1. Capture screenshot from browser
 2. Build prompt with task + action history
 3. Run model -> get thought + action
 4. Execute action in browser (Playwright)
 5. Repeat until send_msg() or max steps


Below we simulate 3 steps with synthetic screenshots.
""")


task_multi = "Go to the Ai2 website and find information about MolmoWeb"


print("--- Step 1: about:blank ---")
step1_img = Image.new("RGB", (1280, 720), color="white")
step1_prompt = build_prompt(task_multi, page_url="about:blank", page_index=0)
step1_raw = run_inference(step1_prompt, step1_img)
step1_parsed = parse_thought_and_action(step1_raw)
print(f"  Thought: {step1_parsed['thought']}")
print(f"  Action:  {step1_parsed['action']}")


history = [{"index": 1, "thought": step1_parsed["thought"], "action": step1_parsed["action"]}]


print("n--- Step 2: Ai2 homepage ---")
step2_img = create_synthetic_webpage(
   title="Allen Institute for AI",
   elements=[
       {"type": "text", "text": "AI for the Common Good", "pos": (50, 120)},
       {"type": "link", "text": "Open Models", "pos": (50, 180)},
       {"type": "link", "text": "Molmo", "pos": (50, 210)},
       {"type": "link", "text": "MolmoWeb", "pos": (50, 240)},
       {"type": "link", "text": "OLMo", "pos": (50, 270)},
       {"type": "link", "text": "Research", "pos": (50, 310)},
       {"type": "link", "text": "News", "pos": (50, 340)},
       {"type": "input", "text": "Search...", "pos": (800, 70)},
   ]
)


step2_prompt = build_prompt(
   task_multi,
   past_actions=history,
   page_title="Allen Institute for AI",
   page_url="https://allenai.org",
   page_index=1,
)
step2_raw = run_inference(step2_prompt, step2_img)
step2_parsed = parse_thought_and_action(step2_raw)
print(f"  Thought: {step2_parsed['thought']}")
print(f"  Action:  {step2_parsed['action']}")


visualise_click(step2_img, step2_parsed["action"], title="Step 2: Ai2 Homepage")


history.append({"index": 2, "thought": step2_parsed["thought"], "action": step2_parsed["action"]})


print("n--- Step 3: MolmoWeb blog page ---")
step3_img = create_synthetic_webpage(
   title="MolmoWeb: An open agent for automating web tasks",
   elements=[
       {"type": "text", "text": "March 24, 2026 | Ai2", "pos": (50, 110)},
       {"type": "text", "text": "Web agents that navigate and complete tasks", "pos": (50, 160)},
       {"type": "text", "text": "in a browser on your behalf.", "pos": (50, 185)},
       {"type": "link", "text": "Models on HuggingFace", "pos": (50, 240)},
       {"type": "link", "text": "Tech Report (PDF)", "pos": (50, 270)},
       {"type": "link", "text": "Training Data", "pos": (50, 300)},
       {"type": "link", "text": "GitHub Code", "pos": (50, 330)},
       {"type": "link", "text": "Live Demo", "pos": (50, 360)},
       {"type": "text", "text": "MolmoWeb-8B achieves 78.2% pass@1 on WebVoyager", "pos": (50, 420)},
       {"type": "text", "text": "94.7% pass@4 with test-time scaling", "pos": (50, 450)},
   ]
)


step3_prompt = build_prompt(
   task_multi,
   past_actions=history,
   page_title="MolmoWeb: An open agent for automating web tasks",
   page_url="https://allenai.org/blog/molmoweb",
   page_index=2,
)
step3_raw = run_inference(step3_prompt, step3_img)
step3_parsed = parse_thought_and_action(step3_raw)
print(f"  Thought: {step3_parsed['thought']}")
print(f"  Action:  {step3_parsed['action']}")


print(f"nFull action history after 3 steps:")
history.append({"index": 3, "thought": step3_parsed["thought"], "action": step3_parsed["action"]})
for a in history:
   print(f"  Step {a['index']}: {a['action']}")




print("n" + "=" * 70)
print("SECTION 8: Action parsing & routing demo")
print("=" * 70)


demo_actions = [
   'click(0.45, 0.32)',
   'goto("https://arxiv.org")',
   'type("MolmoWeb Ai2 web agent")',
   'scroll(down)',
   'press("Enter")',
   'send_msg("The latest paper is titled Molmo2.")',
   'go_back()',
   'new_tab()',
]


print("nParsing various MolmoWeb action strings:n")
for a in demo_actions:
   parsed_a = parse_action_details(a)
   print(f"  Input:  {a}")
   print(f"  Output: {parsed_a}n")

We implement helper functions for prompt construction, model inference, and parsing outputs into structured thoughts and actions. We also build utilities for extracting click coordinates, interpreting action types, and visualizing model predictions on screenshots. These components, collectively, enable us to simulate and analyze the agent’s behavior in a controlled environment.

print("=" * 70)
print("SECTION 9: Batch inference on multiple tasks")
print("=" * 70)
print("Running the model on several different cold-start tasks.n")


batch_tasks = [
   "What is the weather in Seattle right now?",
   "Find the cheapest nonstop flights from NYC to London",
   "Look up the Ai2 careers page and list open positions",
   "Search Amazon for a USB-C hub with at least 4 ports",
]


blank = Image.new("RGB", (1280, 720), color="white")


for i, task_text in enumerate(batch_tasks, 1):
   prompt_b = build_prompt(task_description=task_text, page_url="about:blank")
   raw_b = run_inference(prompt_b, blank, max_new_tokens=200)
   parsed_b = parse_thought_and_action(raw_b)
   action_d = parse_action_details(parsed_b["action"])


   print(f"Task {i}: {task_text}")
   print(f"  Thought: {parsed_b['thought']}")
   print(f"  Action:  {parsed_b['action']}")
   print(f"  Parsed:  {action_d}n")




print("=" * 70)
print("SECTION 10: Exploring the MolmoWebMix training dataset")
print("=" * 70)
print("""
MolmoWebMix consists of three main subsets:
 1. MolmoWeb-HumanTrajs    - 30k human-recorded web task trajectories
 2. MolmoWeb-SyntheticTrajs - Synthetic trajectories from axtree agents
 3. MolmoWeb-SyntheticQA    - 2.2M screenshot QA pairs for visual grounding
""")


try:
   from datasets import load_dataset


   print("Loading a sample from MolmoWeb-HumanTrajs (streaming mode)...n")
   ds = load_dataset(
       "allenai/MolmoWeb-HumanTrajs",
       split="train",
       streaming=True,
   )


   print("Sample entries from MolmoWeb-HumanTrajs:n")
   for i, example in enumerate(ds):
       if i >= 3:
           break


       print(f"  Example {i + 1}:")
       keys = list(example.keys())
       print(f"    Keys: {keys}")


       for k in keys:
           val = example[k]
           if isinstance(val, str):
               display = val[:120] + ("..." if len(val) > 120 else "")
               print(f"    {k}: {display}")
           elif isinstance(val, list):
               print(f"    {k}: list of {len(val)} items")
           elif isinstance(val, dict):
               print(f"    {k}: dict with keys {list(val.keys())[:5]}")
           elif isinstance(val, (bytes, bytearray)):
               print(f"    {k}: binary data ({len(val)} bytes)")
           else:
               print(f"    {k}: {val}")
       print()


   print("Dataset exploration complete.")
   print("Full datasets: https://huggingface.co/collections/allenai/molmoweb-data")


except Exception as e:
   print(f"Could not load dataset: {e}")
   print("You can explore it at: https://huggingface.co/collections/allenai/molmoweb-data")




print("n" + "=" * 70)
print("BONUS: Full production agent loop (reference, not runnable in Colab)")
print("=" * 70)


print('''
import asyncio
from playwright.async_api import async_playwright


async def run_molmoweb_agent(task: str, max_steps: int = 15):
   """Full MolmoWeb agent loop with a live Chromium browser."""


   async with async_playwright() as pw:
       browser = await pw.chromium.launch(headless=True)
       page = await browser.new_page(viewport={"width": 1280, "height": 720})


       action_history = []


       for step in range(1, max_steps + 1):
           screenshot_bytes = await page.screenshot()
           screenshot = Image.open(BytesIO(screenshot_bytes)).convert("RGB")


           prompt = build_prompt(
               task_description=task,
               past_actions=action_history,
               page_title=await page.title(),
               page_url=page.url,
               page_index=step,
           )


           raw = run_inference(prompt, screenshot)
           parsed = parse_thought_and_action(raw)
           action = parse_action_details(parsed["action"])


           print(f"Step {step}: {parsed['thought']}")
           print(f"  -> {parsed['action']}")


           if action["type"] == "goto":
               await page.goto(action["url"], wait_until="domcontentloaded")
           elif action["type"] == "click":
               x_px = int(action["x"] * 1280)
               y_px = int(action["y"] * 720)
               await page.mouse.click(x_px, y_px)
           elif action["type"] == "type":
               await page.keyboard.type(action["text"])
           elif action["type"] == "press":
               await page.keyboard.press(action["key"])
           elif action["type"] == "scroll":
               delta = -500 if action["direction"] == "up" else 500
               await page.mouse.wheel(0, delta)
           elif action["type"] == "go_back":
               await page.go_back()
           elif action["type"] == "send_msg":
               print(f"\nAgent answer: {action['message']}")
               break


           action_history.append({
               "index": step,
               "thought": parsed["thought"],
               "action": parsed["action"],
           })


           await asyncio.sleep(1.5)


       await browser.close()
       return action_history


# Usage:
# asyncio.run(run_molmoweb_agent("Find the latest Ai2 research papers"))
''')




print("=" * 70)
print("Tutorial Complete!")
print("=" * 70)
print("""
What you learned:
 - Loading MolmoWeb-4B with 4-bit quantization on a free Colab T4
 - The structured prompt template (GOAL / PREVIOUS STEPS / ACTIVE PAGE)
 - Single-step inference on blank and real-looking screenshots
 - Multi-step agent loop with accumulated action history
 - Parsing model outputs into structured action dictionaries
 - Visualising click coordinates overlaid on screenshots
 - Batch inference across different task types
 - Exploring the MolmoWebMix training dataset
 - Production agent architecture with Playwright


Resources:
 Models:  https://huggingface.co/collections/allenai/molmoweb
 Data:    https://huggingface.co/collections/allenai/molmoweb-data
 Code:    https://github.com/allenai/molmoweb
 Paper:   https://allenai.org/papers/molmoweb
 Blog:    https://allenai.org/blog/molmoweb
 Demo:    https://molmoweb.allen.ai/
""")

We run full demonstrations including single-step inference, multi-step agent loops, batch task execution, and dataset exploration. We simulate realistic browsing scenarios, track action history, and observe how the model evolves its decisions across steps. This completes the end-to-end pipeline and gives us a clear understanding of how MolmoWeb operates as a functional web agent.

In conclusion, we built a strong practical understanding of how MolmoWeb works as a screenshot-driven web agent in a Colab-friendly Python workflow. We saw how to structure prompts, run inference on visual browser states, parse reasoning and actions, visualize predicted click locations, and simulate multi-step task execution with accumulated history. We also extended the tutorial beyond basic inference by exploring batch predictions, inspecting the MolmoWebMix training data, and studying a production-style browser loop that connects the model to a live Playwright session. Through this process, we run the model and also understand the full pipeline required to turn a multimodal model into a functioning web agent.


Check out the Notebook hereAlso, feel free to follow us on Twitter and don’t forget to join our 120k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

The post How to Build a Vision-Guided Web AI Agent with MolmoWeb-4B Using Multimodal Reasoning and Action Prediction appeared first on MarkTechPost.