Browser automation for AI agents with Browser Use CLI

Carlos Garavito||4 min read
aiagentsbrowser-automationopenclawtools

The problem: my agent couldn't see the real web

My agent can already do a lot of things: search the web, read articles, run commands, manage files. But there was one clear limitation: when I asked it to check something on X, LinkedIn, or GitHub, it could only fetch plain text via web_fetch. No authentication, no interaction, no seeing what I actually see.

What I needed was for the agent to connect to my real browser — with my active sessions — and navigate pages the same way I would: clicking buttons, filling forms, taking screenshots.

Discovering Browser Use CLI

While looking for browser automation tools built for AI agents, I found Browser Use CLI. It's not plain Selenium or Playwright — it's a tool designed specifically to be consumed by LLMs.

The key differentiator: the browser-use state command returns page elements with numeric indices. Instead of raw HTML, you get something like this:

[0] <button> Follow
[1] <a href="/home"> Home
[2] <input placeholder="Search...">
[3] <div class="tweet-text"> Post content...

That's exactly what an LLM needs to reason about a page and decide what to do next. No HTML parsing, no CSS selectors. Just indices.

Another important detail: it runs as a persistent daemon. The browser stays alive between commands, with ~50ms latency. No launching Chrome from scratch on every operation.

Installation: one command

curl -fsSL https://browser-use.com/cli/install.sh | bash

That's it. The script sets up a Python virtualenv at ~/.browser-use-env/, downloads Chromium via Playwright, and makes the browser-use binary available under ~/.browser-use/.

You can verify it works with:

browser-use --version

The Brave problem (and how to solve it with CDP)

This is where the first real obstacle showed up. Browser Use has a --profile flag that lets you use an existing Chrome profile. The problem: it only works with Chrome. I use Brave.

The solution was to use the Chrome DevTools Protocol (CDP). It's the same protocol used by the browser's developer tools panel (F12), Playwright, and Puppeteer. When you launch Brave with --remote-debugging-port=9222, it creates a local server that exposes full browser control.

# Close Brave if it's open, then relaunch with CDP enabled
open -a "Brave Browser" --args --remote-debugging-port=9222

Then connecting the agent is as simple as:

browser-use --cdp-url http://localhost:9222

To avoid remembering these steps every time, I wrote a brave-connect.sh script:

#!/bin/bash
# Cleanly quit Brave
osascript -e 'quit app "Brave Browser"'
sleep 1
 
# Relaunch with CDP enabled
open -a "Brave Browser" --args \
  --remote-debugging-port=9222 \
  --no-first-run
 
# Wait for the port to be ready
echo "Waiting for Brave..."
until curl -s http://localhost:9222/json/version > /dev/null 2>&1; do
  sleep 0.5
done
echo "Brave ready on CDP port 9222"

A note on security

Port 9222 is only accessible from localhost, so it's not exposed to the network. But there's a known attack vector: DNS rebinding. A malicious page could try to redirect requests to localhost:9222 and gain control of the browser.

Is this a realistic risk in everyday use? Low, but not zero. My approach: I use it during active work sessions and don't leave CDP enabled permanently. The browser-use daemon isn't always running either — it starts when I need it, and I disable CDP when I'm done.

If you work with highly sensitive data or in a corporate environment, think carefully about whether this trade-off makes sense for you.

The real test: reading X with my active sessions

With Brave running in CDP mode, I connected the agent and asked it to open X:

browser-use --cdp-url http://localhost:9222
browser-use open https://x.com
browser-use screenshot

Instead of the login page, my full feed appeared. The agent could see exactly what I would see. I asked it to read the most recent posts:

browser-use state

And there they were: posts from Brad Groux about the OpenClaw Foundation, Warp's support for the kitty keyboard protocol, Browserbase's new CLI. Then I navigated to a specific post from @anibal about dynamic skill injection in Claude Code — something I wanted to read in more detail.

The full flow was:

# 1. Open a page
browser-use open https://x.com/@anibal/status/...
 
# 2. Get page state (indexed elements)
browser-use state
 
# 3. Click something if needed
browser-use click 4
 
# 4. Verify the result
browser-use screenshot

Clean, predictable, and consumable by any LLM.

Packaging it as an OpenClaw skill

With the flow validated, the next step was wrapping it as a proper OpenClaw skill. The structure ended up like this:

~/.openclaw/workspace/skills/browser-use/
├── SKILL.md              # Concise instructions for the agent
├── references/
│   └── commands.md       # Full reference for all commands
└── scripts/
    └── brave-connect.sh  # Helper to connect Brave via CDP

The SKILL.md is deliberately short: it explains the main workflow (connect via CDP, open URL, use state to navigate), establishes Brave as the primary method, and points to references/commands.md for full detail.

The progressive disclosure pattern was the key design decision here. The agent doesn't need to read 200 lines of documentation to complete its task. With the essential workflow in SKILL.md and the full reference kept separate, context is only used when it's genuinely needed.

The result

My agent can now:

  • Connect to my Brave with my active sessions
  • Navigate any authenticated site: X, LinkedIn, GitHub, Gmail
  • Read content from feeds and protected pages
  • Click, fill forms, take screenshots
  • All through simple CLI commands

What started as "I want the agent to read my tweets" turned into full access to the authenticated web. The tool was there, the protocol was there — it just needed to be wired up correctly.

If you're using OpenClaw or any agent that supports CLI tools, Browser Use is well worth exploring. The approach of returning indexed elements instead of raw HTML is exactly the kind of interface LLMs need to reason about the browser reliably.