For a few years now, we’ve been using AI to analyze meeting notes at Hyperflow. Every call gets recorded. Every recording gets transcribed. Every transcript gets fed to AI for summaries, action items, and follow-ups. That pipeline has been running in the background for a while, and it’s been good. Useful. Reliable.
But a few weeks ago, right around the time I moved everything over to OpenClaw, I started noticing the gaps.
A client would share their screen and walk through a dashboard. They’d point at a chart and say, “this number right here, that’s what we need to fix.” The transcript would capture the words. What it wouldn’t capture: which number. Which chart. The thing they were literally pointing at on screen.
Or we’d be in a design review. Someone would pull up a mockup and say, “I don’t love the spacing on this section.” The transcript gives me the words. But the words without the visual? Useless. I’d have to go back, re-watch the recording, find the moment, screenshot it, then manually connect it to what was said.
That’s the kind of work that doesn’t feel like work. It feels like being thorough. But it’s the same thing every time: re-watch, find, screenshot, connect. Over and over. For every call with a visual component, which at Hyperflow is most of them.
So I added one thing to the pipeline. And it’s changing the game.
The Addition
I gave my AI eyes.
Instead of only reading the transcript, OpenClaw now pulls frames from the video recording. Not every frame. Key frames: moments where the screen changes significantly, where someone shares their screen, where a new document or mockup or dashboard appears.
Then it analyzes those frames with vision models and ties them back to the transcript. It knows what was on screen when someone said what they said. And from that, it generates visual to-dos: annotated screenshots paired with the specific action items that came out of that moment.
The difference between “fix the spacing issue Sarah mentioned” and a screenshot of the exact section with an annotation saying “Sarah: reduce padding between header and chart, feels cramped” is night and day. One requires me to remember context. The other gives me the context.
The Pipeline
how it works (roughly sketched)
The Setup (It Was Already Halfway There)
Here’s why this was easier than it sounds: the infrastructure was already in place.
Every call at Hyperflow runs through a shared Google Drive folder. This folder already gets populated automatically with two things after every call: the full video recording and the transcript. That’s been our system for months. Nothing new there.
So OpenClaw was already watching that folder. It was already picking up transcripts and processing them. The addition was telling it to also grab the video file, extract key frames, and run them through a vision model before generating the summary.
The source material was sitting right there, untouched. I was feeding my AI the text version of a video call and wondering why it missed the visual parts. In hindsight, that’s like giving someone a phone transcript of a movie and asking them to describe the cinematography.
Two Models, Head to Head
Right now I’m testing this with two different vision setups, running side by side.
Opus 4.6 handles the full pipeline on one path. It reads the transcript, analyzes the extracted frames, and generates the combined output. It’s good at understanding context across a long call and connecting frames to the right parts of the conversation. The summaries feel cohesive. It doesn’t lose the thread.
OpenAI’s Vision models run the same pipeline on a parallel path. Same frames, same transcript, same prompt structure. Different model doing the analysis.
I’m not ready to declare a winner. Both produce useful output. Opus tends to be better at the narrative connections (understanding why something was said in context). OpenAI’s vision is strong on the raw image analysis (identifying UI elements, reading text from screenshots). The ideal might end up being a combination: one model for frame analysis, the other for synthesis.
The point isn’t which model wins. The point is that both of them produce dramatically better meeting follow-ups than transcript-only analysis. The vision layer is the unlock. The specific model is a tuning decision.
What the Output Looks Like
After a call ends, here’s what I get within about 15 minutes:
Hyperflow Weekly Sync
"This conversion rate on the pricing page is way too low. We need to rethink the layout above the fold."
"The onboarding flow drops off right here at step 3. People aren't completing the profile section."
Each moment from the call gets its own card. The annotated screenshot shows exactly what was on screen. The speaker’s words are tied directly to what they were looking at. And the to-dos aren’t vague. They’re specific, visual, and attributed.
Three months from now when someone asks “who requested that change?” I have the receipt. Not a note I typed. A screenshot of what they were pointing at, with their exact words attached.
Why This Matters More Than It Sounds
Here’s the thing about meeting follow-ups: everyone does them, and almost everyone does them badly.
You leave a call. You have a vague list of “things we discussed.” Maybe you typed some notes. Maybe your AI transcription tool gave you bullet points. But the connection between what was said and what was shown is gone the moment the call ends. It lives in your memory, and memory is unreliable.
I used to compensate for this by taking detailed notes during calls. Which meant I was half-present in the meeting. I was there, but I was also documenting, screenshotting, and organizing instead of listening.
Now I’m fully in the call. I don’t take notes. I don’t screenshot anything. I listen, I contribute, I pay attention. And when it’s over, OpenClaw hands me a visual record that’s more thorough than anything I could have produced manually. Because it saw everything I saw, and it didn’t get distracted.
That’s the actual shift. Not “AI does my meeting notes.” That’s been possible for a year. The shift is: AI sees what happened in the meeting the same way I do. Visually. In context. With the full picture.
Give Your AI Eyes
If you’re running any kind of AI meeting analysis, even a basic transcript summary, consider what you’re not feeding it. If your calls involve screen shares, demos, design reviews, dashboard walkthroughs, or anything visual, you’re giving your AI a partial picture and expecting a complete analysis.
The video is already there. Most recording tools save it automatically. Most of us ignore it after the call ends. But that video contains information that the transcript doesn’t. And vision models are now good enough to extract it.
You don’t need my exact setup to start. You need a recording, a vision-capable model, and the willingness to experiment. Extract some frames from your last call. Feed them alongside the transcript. See what comes back.
I started this as a small experiment three weeks ago. Now it’s the part of my workflow I’d fight hardest to keep.
Transcripts gave me words. Vision gave me context. The combination gave me meetings I don’t have to re-watch. I’ll take that trade every time.