Transcript
CXsjgokvlJ4 • Microsoft’s Fara-7B Explained: The Tiny AI That Uses Your PC Like a Human
/home/itcorpmy/itcorp.my.id/harry/yt_channel/out/BitBiasedAI/.shards/text-0001.zst#text/0204_CXsjgokvlJ4.txt
Kind: captions
Language: en
You've probably been waiting for AI to
actually do things for you instead of
just giving you answers. Well, you're
not alone. Most of us have been stuck
copying and pasting AI responses,
manually completing tasks that we
thought AI would handle by now. Trust
me, I felt the same frustration. But
here's what surprised me. Microsoft just
released an AI that literally uses your
computer like a person would, clicking,
typing, navigating websites, and it's
only 7 billion parameters. That's tiny
compared to GPT3's 175 billion. Yet,
it's outperforming models 100 times its
size. Welcome back to bitbiased.ai,
where we do the research so you don't
have to. Join our community of AI
enthusiasts with our free weekly
newsletter. Click the link in the
description below to subscribe. You will
get the key AI news, tools, and learning
resources to stay ahead. So, in this
video, I'm going to show you exactly how
Ferra 7B works, why it's a gamecher for
privacy and ondevice AI, and what it
means for the future of AI assistants
that actually complete tasks for you.
By the end, you'll understand not just
what makes this model different, but how
it could transform the way you work with
AI day-to-day. And the best part, it's
completely open- source and free. First
up, let's talk about what makes Farah 7B
fundamentally different from every
chatbot you've used before.
What makes Farah 7B
actually different? Here's the thing
about most AI assistants. They're
brilliant conversationalists, but they
can't actually do anything.
You ask chat GPT to book a flight and
what do you get? Instructions, steps, a
polite explanation of how you should do
it. FAR 7B flips that entire paradigm on
its head. Imagine an AI that doesn't
just chat with you, but actually opens
your browser, navigates to websites,
fills out forms, and completes tasks
while you watch. That's exactly what
Ferra 7B does. Microsoft calls it a
computer use agent model. And unlike
traditional chat bots, Farah 7B
leverages your computer's mouse and
keyboard to complete tasks on your
behalf.
It literally sees your screen and clicks
and types as needed to perform
multi-step tasks just like you would.
Now before you think this requires some
massive supercomput, here's where it
gets interesting. With only 7 billion
parameters, Farah 7b is surprisingly
compact. For context, GPT3 had 175
billion parameters. Yet Microsoft calls
Farra 7B an ultra compact computer agent
that already achieves state-of-the-art
performance for its size.
And wait until you hear this. It's
completely open source and freely
available under an MIT license, so
anyone can try it out on Windows or
Linux PCs. But the real question is, how
does something this small actually work?
Let's dive into that next. How FAR sees
and uses your computer. Think of FAR 7B
as having a pair of digital eyes and
hands.
But here's what makes it unique. It
doesn't cheat by reading hidden browser
code or accessing special metadata that
regular users can't see.
Instead, it processes raw screenshots of
your browser or desktop exactly like a
human looking at the screen and then
outputs actions, predicting exactly
where to click or what keys to press.
Microsoft describes it this way. Farah
7B operates by visually perceiving a web
page and taking actions such as
scrolling, typing, and clicking at
directly predicted coordinates. In
practice, this means Farra doesn't rely
on any hidden browser code or
accessibility metadata. It only uses the
pixel image of the page just as you do.
This visual first design gives Farah 7B
what one researcher called pixel
sovereignty. The AI keeps all image and
reasoning data on your device.
And this next part is crucial,
especially if you work in regulated
industries like healthcare or finance
because everything stays on your
computer. It helps meet strict
compliance rules like HIPPA and GLBA by
keeping user data local to your machine.
No screenshots sent to the cloud. No
sensitive information leaving your
device. Because FAR 7B sees the screen
directly, it can handle complex or even
obfiscated websites that might stump
other approaches.
The model works in two parts. First, a
short reasoning step where it thinks
about what to do. Then a precise action
command. The available actions are basic
GUI operations. Move the mouse to
specific coordinates and click or type
text which mimic exactly what you would
do manually. In effect, Ferra 7B
transforms your natural language
instructions into a sequence of mouse
and keyboard actions. You tell it what
you want and it figures out the steps to
make it happen.
But here's where you might be wondering,
how do you train an AI to do this
without having thousands of people
manually demonstrating tasks?
That's the genius part and it involves
something Microsoft calls synthetic data
generation.
Training with synthetic data,
the secret sauce.
Getting real examples of people
controlling a browser for hundreds of
different tasks would be incredibly
expensive and timeconuming.
So Microsoft got creative. They built a
synthetic data generation pipeline
called Farraen. And it's basically a
team of AI agents that invent tasks and
then solve them to create training
examples. Here's how it works. First,
the system proposes thousands of
realistic tasks by seating prompts with
real website URLs.
For example, book two tickets to Wicked,
the movie on Fandango, or find a blue
hoodie under $50 on Amazon.
These are real world tasks you or I
might actually do. Then comes the clever
part. A pair of bot agents, an
orchestrator and a web surfer, actually
go through the steps of completing each
task on the live web. They simulate
clicks, form fills, searches,
everything. They're basically learning
by doing, just like a human would.
Finally, other verifier bots review the
screen recordings to make sure the task
was done correctly, discarding any
failures. The result, a massive set of
verified computer interaction
trajectories, sequences of screenshots,
actions, and reasoning steps that solve
each task. Using this method, Microsoft
generated about 145,000 task
trajectories covering over 1 million
individual steps, spanning many kinds of
web activities like shopping, booking,
travel, filling forms, and searching for
information.
All this data comes from real websites
and plausible user prompts. Then they
distilled that complex multi- aent
process into one single model Farah 7B
through supervised fine-tuning.
In other words, Farah 7B learned to
mimic the successful example
trajectories from the pipeline.
The key idea is this. A small model can
learn to act like a multi- aent system
without needing all those extra agents
at runtime. And this approach is working
better than anyone expected.
But to understand why, we need to look
under the hood at how this model is
actually built. Model architecture and
why size doesn't always matter. At its
core, Ferra 7B is built on a vision
language transformer based on the Quen
2.5 VL7B model. This is a 7 billion
parameter multimodal language model with
strong visual grounding capabilities.
What that means in practical terms is it
can take an image, your screenshot plus
text, your instruction and output
actions. The overall design is elegantly
simple. Pixel in action out. The only
input is the latest screen images and
your task description and the output is
a reasoning step plus a tool command.
It's like having one unified brain that
sees the page and decides the next mouse
or keyboard move.
Now, here's where the compact size
becomes a massive advantage. Because
Ferrris 7B is relatively small at 7
billion parameters, it can run locally
on even modest hardware. Microsoft has a
version optimized to run on PCs with AI
hardware like Copilot Plus PCs, and it
can even run under WSL 2 or similar
environments on standard machines.
Running on device has two real benefits
that matter in the real world. First,
lower latency. No network delay waiting
for cloud responses.
Second, and this is huge, much stronger
privacy because your sensitive screen
data never leaves your machine. Every
screenshot, every task, every piece of
information stays completely local.
But you're probably thinking, if it's so
small, how well does it actually perform
compared to the big models?
Well, this next part might surprise you.
Benchmarks. When David beats Goliath,
Farah 7B has been tested on standard web
navigation benchmarks, and the results
are honestly shocking.
On the Web Voyager benchmark, which is a
common test for web agents, Farah 7b
achieved about 73.5%
task success rate. Now, let me put that
in perspective for you. GPT40, OpenAI's
vision capable model that's
significantly larger and runs in the
cloud, reached about 65.1%
on the same test. Another comparable 7
billion parameter agent called UITARS
1.57B
scored around 66.4%.
In other words, Farah 7B is beating
models that are orders of magnitude
larger. It's genuinely state-of-the-art
for its size class.
But here's where it gets even better.
Not only does Farah 7B succeed more
often, it uses fewer steps to finish
tasks. In testing, Farra averaged only
about 16 steps per task versus 41 steps
for a similar 7 billion parameter agent.
Fewer steps generally means faster
execution and lower computational cost,
which translates to better user
experience and efficiency. Microsoft
actually plotted Ferra 7B on a graph of
accuracy versus computational cost and
it sits on what they call a new PTO
frontier. That's a fancy way of saying
it. offers an optimal balance of
performance and efficiency that other
models can't match at this size. The
point is that despite being tiny by
modern language model standards, Ferra
7B is exceptionally capable for agentic
tasks. It breaks ground on a new
frontier, showing that ondevice agents
are approaching the capabilities of
massive frontier models. You don't have
to use a huge cloud AI to get competent
browser automation. A well-trained small
model can actually do better in many
cases. And when you compare it directly
to other models in the field, the
differences become even clearer. FAR 7B
versus the competition. FAR 7B isn't the
only research effort toward AI agents.
Obviously, OpenAI has GPT40, which is
vision enabled and can be prompted to
act on a browser.
Anthropic has computer use capabilities
and claude.
But in head-to-head comparisons, FAR 7B
is holding its own against these giants.
It outperformed GPT40 on the benchmarks
we just discussed. It even beat GPT40 on
a new test called Webtail Bench, which
is a collection of more complex real
world tasks.
Microsoft's own UI TAR model and other
competitors fell behind as well. The big
difference, GPT40 runs in the cloud and
requires significantly more
computational resources. For 7B's
strength is doing almost as well or
better with a model that can actually
live on your PC.
It's also completely open and free to
use, whereas the bigger models require
subscription APIs and send your data to
remote servers. This speaks to
Microsoft's broader strategy, which is
worth understanding if you want to see
where AI assistants are headed.
Microsoft's strategic play. Why did
Microsoft build Farah 7B? It fits into a
broader strategic push that's been
building throughout 2024. Microsoft
started rolling out small language
models like the five family and embedded
AI into Windows PCs with the co-pilot
plus PC initiative. Farah 7B is their
first agentic small language model
meaning it can take actions not just
chat. A key goal here is ondevice AI for
enterprises. By keeping the model local,
businesses can automate workflows like
booking travel, managing accounts, or
filling forms without sending sensitive
data to the cloud. This addresses one of
the biggest barriers to corporate AI
adoption, data security and compliance.
As one Microsoft researcher explained,
processing all visual input on device
creates true pixel sovereignty, which
helps in regulated fields that have to
comply with HIPPA, GBA, and other strict
data protection requirements. User data
remains local, improving both privacy
and compliance. Microsoft also sees
Ferra 7B as a building block for future
innovation. By open- sourcing it,
they're encouraging a community to test,
fix, and extend Agentic capabilities. It
complements their broader AI ecosystem
vision, linking models, tools, and
platforms like Azure Foundry and
Magentic UI into one cohesive system.
There's also a strategic independence
angle here. Farah 7B came shortly after
Microsoft and Open AAI redefined their
partnership, giving Microsoft more
freedom to pursue AI research
independently.
This is Microsoft's way of reducing
reliance on OpenAI's cloud
infrastructure by developing their own
capable agents. It's a step towards
self-sufficiency in the AI race. But
strategy aside, what can you actually do
with this technology right now? Real
world applications you can use today.
The team demonstrated a range of
practical examples that show Far 7B's
versatility. In demo videos, it
successfully went shopping for an Xbox
controller, booked movie tickets on a
cinema site, summarized issues from a
website, and even used map and search
tools to plan a trip.
These aren't cherrypick simple tasks.
They're real workflows. In everyday
terms, Ferra 7B can handle tasks like
searching the web for specific
information, filling out forms, booking
travel or events, comparing product
prices across websites, and managing
online accounts. Imagine telling it,
"Find a blue t-shirt with over 500
reviews and add it to my cart, or book
two roundtrip flights from New York to
LA in March." And it would navigate the
appropriate sites to get it done.
Because it interacts with websites just
like a human would. Farah 7b could
automate mundane office tasks that eat
up your time. Imagine it checking your
company's internet, extracting data from
reports, or processing information by
itself while you focus on higher value
work.
It could serve as a personal digital
assistant for the web, handling
repetitive tasks that don't require your
direct attention.
And since it runs locally, you could
even use it for sensitive tasks that you
wouldn't want to paste into a public
chatbot.
Financial research, confidential
document handling, or internal business
workflows, all done on your device
without ever touching the cloud.
But with great power comes great
responsibility. And Microsoft is well
aware of the risks involved in giving an
AI control of your computer.
Safety and ethics. The critical point
system. An AI that can click around your
computer autonomously raises legitimate
safety questions. What if it makes a
mistake? What if it accesses something
it shouldn't? Microsoft built several
important safeguards to address these
concerns. First, user control is central
to the design. Farah 7B is meant to run
in a sandbox environment with all its
actions fully logged. Every single
action, every click, every keystroke can
be audited by you in real time. You can
intervene at any moment to stop it or
change course.
Second, and this is particularly clever,
Farah 7B was trained to recognize what
Microsoft calls critical points. These
are moments where the next action would
expose personal or sensitive data like
entering an email address, confirming a
purchase, or sending a message.
At a critical point, Ferris 7B is
designed to pause and ask for your
permission before proceeding. For
example, if it's about to fill in your
credit card information or send an email
on your behalf, it will stop and say, "I
need your approval before I do that."
This pause for consent mechanism helps
prevent runaway actions or privacy
leaks.
It's like having a safety net built into
the decision-m process.
Third, Microsoft used red teaming and
curated data during training to steer
the model away from harmful tasks. They
mixed in refusal tasks so that FAR 7B
learned to say no to illegal or
dangerous requests. In testing, it
refused 82% of undesirable prompts in a
red team evaluation set. The team is
transparent about limitations, though.
They admit the model isn't perfect. It
can still hallucinate or misinterpret
complex instructions like any AI model.
They caution that Farah 7B is
experimental and recommend running it
only in controlled environments, not on
your main personal or financial
accounts, at least not yet. Finally,
privacy is improved by design. Unlike
some agents that pull extra hidden data
from your browser, like the
accessibility tree, Ferris 7B only uses
the visible screen. No additional site
data is accessed. It interacts with the
computer the same way a human would,
relying solely on what's visible on the
screen. This keeps the model's view
limited and simpler to audit. So where
does all this lead us? The future. What
comes next? FAR 7B is just the first
step, not a finished product. Microsoft
plans to iterate on it, making it
smarter and safer over time.
They've mentioned possibilities like
adding live reinforcement learning so
the agent can learn from trying tasks
interactively in a sandbox environment.
They're working on improving instruction
following and refining safety checks
even further. The goal isn't to make a
bigger model, but a smarter and safer
one. Microsoft is committed to keeping
models small enough to run on devices
while continuously improving their
capabilities.
This is a fundamentally different
approach from the bigger is always
better philosophy that dominated AI
development for years.
In the near future, you might see FAR
based assistance built into Windows apps
or enterprise tools quietly checking
your email, summarizing reports, or
booking meetings with just a simple
command, all under your supervision.
The models release on Hugging Face and
Azure Foundry invites developers
worldwide to experiment. If someone
creates a breakthrough application or
discovers a better training method, the
openw weight model means that innovation
can spread quickly across the community.
We're entering a new chapter in AI
assistance. Not chatbots that tell you
how to do things, but agents that
actually do them for you.
For 7B is proof that you don't need
massive models to achieve this.
A well-designed, carefully trained,
compact model can outperform giants on
practical tasks while running entirely
on your device. Final thoughts.
Microsoft's FARS 7B represents a
fundamental shift in how we think about
AI assistance. It's a proof of concept
that a tiny 7 billion parameter model
can handle big agenic tasks, outperform
much larger AIs on web navigation, run
locally for privacy, and open up
entirely new possibilities for how we
interact with technology. It may sound
like science fiction, an AI that
autonomously uses your computer, seeing
and clicking just like you would. But
according to the research and early
testing, it's becoming reality faster
than most people expected.
As with any powerful new tool, it will
need careful handling, ongoing safety
improvements, and responsible
deployment. But it points the way toward
a future where AI genuinely helps us by
doing things in the digital world, not
just talking about them. And that future
might be closer than you think. If you
found this breakdown valuable and want
to see more deep dives into emerging AI
technologies, let me know in the
comments what you'd like covered next.
Are you excited about ondevice AI agents
or do the safety concerns worry you
more?
I'd love to hear your perspective.
Thanks for watching and I'll see you in
the next one.