AI Agents Have Come a Long Way

After the initial hype around agents like Manus, I tested them on complex, real-world tasks like generating presentations. They were far from practical back then. Has that changed? It’s time for another look.

The Forms and Functions of AI Agents

AI browsers have been in the spotlight recently. Coupled with the rise of models known for their agent capabilities like Kimi K2, GLM 4.6, and Minimax M2, I’ve been seriously considering the future of agents in practical applications.

Riding the AI browser trend, I’ve been thinking about the challenges agents face in the digital world. The truth is, no single model or product can handle everything perfectly yet; each task has unique requirements.

Just like chatbots, there’s no one-size-fits-all agent. It’s best to have a few different tools on hand for different problems.

The top-left and bottom-right quadrants are currently the most mature, as the web is decentralized while the OS is centralized.

AI browsers, Claude Code, Manus—they’re all fundamentally the same. They let an AI control a self-contained browser sandbox or local environment to handle complex, time-consuming tasks with various tools.

Since models like Kimi, GLM, and Minimax boast impressive agent capabilities, have their official products leveraged these skills to rise above the competition from major overseas AI labs and Chinese tech giants?

A quick look confirmed it—I was just late to the game. The flagship AI products from the big overseas players and Chinese internet giants lack full agent capabilities, offering “Deep Research” at best. Strip away the image and video generation, and they’re just plain old chatbots.

But Kimi, GLM, and Minimax have integrated full-fledged agent features. Kimi has “OK Computer,” GLM (Z.ai) offers “Full-Stack,” and Minimax has its “Pro mode.”

With these agent capabilities, could they become my daily drivers for AI?

The Three Tests

I happen to keep a list of tasks I’ve previously thrown at AI, which are perfect for testing these new agent products:

What’s the current fighter jet lineup of the Chinese Air Force? Find the main models and grab photos of each from various angles online.
Create an illustrated presentation on the history of Earth’s geological ages, preferably in PowerPoint format.
This is my personal website: http://victor42.eth.limo/. I want to check my personal information exposure. Scour the internet for as much of my private info as you can and see what you can find out about me.

The short answer: they’ve improved and are almost usable, but they still need human guidance and course correction every step of the way.

Test 1: Air Force Fighter Lineup

For the first test, Kimi delivered a fairly complete result. I’m no military expert, so I didn’t fact-check the data, but one look at the photos told me they were wrong. It mixed up many of the aircraft models.

Kimi’s output: https://sbudgp6km5i3s.ok.kimi.link/

I’m hesitant to even share GLM’s result. It just generated AI images of jets. After I complained several times, it tried to pull a fast one by labeling a landscape picture “real photo” and using scenic shots instead of actual aircraft photos.

Minimax was painfully slow. The other two were done with all tests before it even finished the first one. However, the page layout was clean, and its image matching was the most accurate of the three.

Minimax’s output: https://nycqzyogwce4.space.minimaxi.com/

Test 2: Geological Ages Report

For the geology presentation, I expected them to code an HTML-based slideshow. GLM does have a PPT mode, which I found generates HTML and then converts it. But I intentionally chose its “Full-Stack” mode to see what a general-purpose agent could do with this task.

This task didn’t require much online research, as the models’ internal knowledge was sufficient. Both Kimi and GLM handled it well. GLM produced an HTML file, not a PPT. Minimax’s agent was just too slow, so I gave up on it.

Kimi’s output: https://qvokpfxqsh.feishu.cn/file/Sdz0bwNffoAFXKxqyItc4WNenwc?from=from_copylink

GLM’s output: https://p0r7a94j92w1-deploy.space.z.ai

Same old problem: all AI-generated images.

Test 3: Personal Information Exposure

The third test could have been handled by the “Deep Research” features, but I used it to test the agent’s ability to plan and gather information comprehensively. This really tests the model’s core capabilities, not just its agent skills. I wasn’t concerned with the format, only the content.

Kimi produced a flashy-looking report, but the content was thin and the information gathering was superficial.

Kimi’s output: https://dgkenxfkgs2to.ok.kimi.link/

GLM refused to run the task twice, citing security reasons.

Minimax delivered a detailed markdown file. It was clear it had independently researched various pieces of information before compiling the final report.

Minimax’s output: https://agent.minimaxi.com/share/328823906788332?chat_type=0

For comparison, here’s how a non-agent product, Grok, handled the third question: https://grok.com/share/bGVnYWN5LWNvcHk%3D_acd6451b-b37a-405e-a700-91d692edaac6 This shows that on complex tasks, even without special tool-calling abilities, agents outperform chatbots.

In fact, you could likely get similar results from the agents in Kimi, GLM, and Minimax by using their APIs with a tool like Claude Code to run tasks on your local machine. The only real difference is the environment shifts from a cloud Linux server to your own Windows or Mac.

So, in essence, all these different types of agent products are cut from the same cloth.

Role in Non-Standardized Tasks

Looking back at the quadrant chart, my tests only covered the two right-side quadrants, which involve standardized tasks like local file operations and web requests.

With standardized tasks, you get predictable results as long as you follow the correct procedure.

Today’s agents are already quite powerful for these. If you know the right steps for a task, they can be a massive help.

But the tasks on the left side of the chart are far more ambiguous. Asking an AI to navigate a non-standard GUI on a website or local app yields unpredictable results. You never know if the task will even be completed. This area is far less mature, and we’ve yet to see a true killer app.

Even with pioneers like Dia/Comet and now Atlas, this reality hasn’t changed.

Understanding a GUI requires more than just parsing HTML; it needs strong visual capabilities. Ideally, the AI would receive a continuous video stream, like a video call feature.

Otherwise, it could take minutes just to find a single button on a page.

But the cost of providing such a feature to everyone would be astronomical.

Still, even in their current state, agents can be incredibly helpful for certain non-standardized tasks.

I’ve recently been researching vacation islands in Southeast Asia. Step one: identify the potential islands.

When it comes to travel info, I only trust sources like Xiaohongshu and Mafengwo, not the open web. I used an agent with Playwright MCP. After I logged it in, it scoured the sites based on my instructions, gathering a ton of information. I had it expand the search twice and then run a verification round.

I then double-checked the verified results with several other AI tools, and everything checked out.

Just like that, I had a solid list of potential destinations to start my planning. I then used similar methods to have the AI flesh out the details, one dimension at a time, until I could narrow it down to a single choice.

From there, I switched to my usual travel planning methodology and manually crafted the full itinerary:

A Step-by-Step Guide to Travel Planning

Postscript

At the start of the year, people were calling it the “Year of the Agent.” It turns out they weren’t exaggerating.

Agents have already borne fruit in the programming world. Their success is undeniable, and I’ve been using them heavily for a while. Now, they’re starting to prove their value in other fields too.

It’s the perfect time to shift our perspectives and start experimenting. I just hope I’m not too late to the party.

Finally, for comparison, here’s a link to a test I did a while back on AI-generated presentations. You can see just how much progress agents have made:

Can AI Make PPTs Independently Now