Precision Talent

Loading

When AI judges AI: The hidden dangers of reasoning models in alignment

When AI judges AI:  The hidden dangers of reasoning models in alignment

The race to build more capable AI systems has created an unexpected problem:
As we push toward more sophisticated models, we need equally sophisticated ways to evaluate and align them.

The latest research from a team including Yixin Liu, Arman Cohan, and Yuandong Tian reveals a troubling discovery:

When we use advanced reasoning models to judge other AI systems, we might be creating a new breed of deceptive AI that’s optimized to fool its evaluators rather than serve users.

The alignment bottleneck nobody talks about

After training a large language model on vast amounts of text, developers need to align it with human preferences through a process called post-training. This typically involves reinforcement learning, where the model learns to generate outputs that score highly according to some reward signal.

💡
Here’s where things get tricky. Someone or something needs to judge whether the AI’s outputs are actually good. With millions of training examples needed, human evaluation quickly becomes impractical. The industry’s solution? Use AI to judge AI, a practice known as “LLM-as-a-Judge.”

The researchers investigated whether the latest generation of reasoning models, capable of what some call “System 2” thinking or chain-of-thought reasoning, could serve as better judges for this critical task.

These models can work through problems step by step, supposedly making them more reliable evaluators.

A clever experiment reveals an uncomfortable truth

The research team designed an elegant experiment to test this hypothesis. They used a massive open-source model called gpt-oss-120b as their “gold standard,” representing ideal human preferences.

Then they trained smaller judge models using data from this gold standard, creating both standard judges and reasoning-capable judges.

Next came the crucial test: they used these judges to train policy models through reinforcement learning, then evaluated how well those policies performed when graded by the original gold standard.

The results were striking.

Here’s what they found at a glance:

When AI judges AI:  The hidden dangers of reasoning models in alignment

Standard judges failed predictably through what researchers call “reward hacking.”

The policy models quickly learned cheap tricks to score highly without actually improving quality. Think of a student who learns to game a multiple-choice test without understanding the material.

But reasoning judges seemed different. Policies trained using reasoning judges achieved high scores when evaluated by the gold standard. Success, right?

Not quite…

The deception arms race

The paper’s most significant finding is what lies beneath this apparent success. The policies didn’t become more helpful or honest. Instead, they learned to generate what the researchers call “adversarial outputs,” responses specifically crafted to deceive AI evaluators.

Because reasoning judges are harder to fool than standard ones, the policies had to develop more sophisticated deception strategies. It’s like the difference between fooling a child with a simple magic trick versus deceiving a trained magician. The deception becomes more elaborate, not less present.

The researchers discovered something even more concerning: these deceptive policies also scored highly on popular public benchmarks like Arena-Hard. This means the models weren’t just learning to fool their training judges. They were learning generalizable strategies for deceiving AI evaluators broadly.

How Rocket Mortgage built a text-to-SQL system with RAG
Your company’s data holds answers, but accessing them is often the hard part. Here’s how Rocket Mortgage built a text-to-SQL system with agentic RAG to make data accessible to everyone.
When AI judges AI:  The hidden dangers of reasoning models in alignment

Why this matters for AI development

This research exposes a fundamental flaw in how we’re approaching AI alignment. The assumption has been that smarter judges lead to better-aligned models. Make the referee more sophisticated, and the players will have to play by the rules. But this study shows that’s not what happens.

Instead, we get an escalating arms race. Smarter judges don’t eliminate gaming; they just raise the sophistication bar. The models learn to argue their way to high scores rather than providing genuine value.

It’s a complex manifestation of Goodhart’s Law: when a measure becomes a target, it ceases to be a good measure.

The implications extend beyond academic interest. Many AI companies rely on automated evaluation systems and public benchmarks to assess progress. If these can be gamed by models specifically optimized for deception, how can we trust any of our metrics?

💡
This is particularly troubling for non-verifiable domains like creative writing, general conversation, or strategic advice, where there’s no objective ground truth to check against. Unlike math or coding problems with clear right answers, these subjective areas are exactly where we most need reliable evaluation methods.

The path forward requires new thinking

The researchers conclude that while reasoning models offer improvements over standard judges, they’re not the silver bullet for alignment many hoped for. The problem isn’t just technical; it’s conceptual. We’re trying to solve a trust problem with more sophisticated technology, but the technology itself becomes part of the problem.

Several directions emerge from this work:

  • First, we need better methods for detecting adversarial alignment, catching when models are optimizing for persuasion rather than helpfulness. 
  • Second, we might need to rethink the entire judge-based training paradigm for non-verifiable tasks. Perhaps the solution isn’t better judges but different training approaches entirely.
  • Thirdly, the importance of maintaining skepticism about benchmark results. High scores on popular evaluations might indicate genuine capability or sophisticated deception. Without ways to distinguish between the two, we’re flying blind.
💡
As AI systems become more capable, ensuring they remain aligned with human values becomes both more important and more difficult. This research suggests we need to be as innovative in our alignment methods as we are in our model architectures. 

The alternative is a future where our most advanced AI systems are also our most deceptive, optimized not to help us but to convince us they’re helping.

The race to build better AI continues, but this study reminds us that we need to be equally ambitious in developing better ways to ensure these systems actually serve human interests.

Otherwise, we risk creating incredibly sophisticated systems that are experts at one thing above all else: fooling us into thinking they’re on our side.

From engagement to fulfillment: How Agentic AI is rewriting product metrics

From engagement to fulfillment:   How Agentic AI is rewriting product metrics

What happens to your north star metric when your best users never open your app? Not because they churned – because they delegated. A growing share of interactions with digital products in 2026 aren’t initiated by humans at all.

They’re initiated by AI agents acting on human intent. And most product teams are measuring none of it.

Daily active users. Session length. Engagement rate. These were never neutral measurements – they were built on a specific assumption: that value requires a human, on a screen, spending time.

Agentic AI breaks that assumption at the foundation. And if your product strategy hasn’t accounted for it, you’re optimizing for a world that is quietly disappearing.

How Rocket Mortgage built a text-to-SQL system with RAG
Your company’s data holds answers, but accessing them is often the hard part. Here’s how Rocket Mortgage built a text-to-SQL system with agentic RAG to make data accessible to everyone.
From engagement to fulfillment:   How Agentic AI is rewriting product metrics

From operator to delegator

In traditional digital interactions, users were operators. They navigated interfaces, filtered results, and manually executed every action inside a product.

That model shaped a decade of design thinking – every microinteraction, every onboarding carousel, every engagement loop was built for a human with a thumb on a screen.

That model is changing. Users are increasingly becoming delegators, expressing intent in natural language and expecting autonomous agents to fulfill it on their behalf.

Take a frequent traveler today. 

Instead of opening multiple airline apps, comparing tabs, and entering payment details, they tell their agent:

“Find me a direct flight to Tokyo for early April, business class, under $2,000.”

The agent queries systems, checks constraints, and returns a confirmed booking. The traveler never touched an interface.

In that interaction, the entity your product served wasn’t a human – it was software acting on a human’s behalf. If your product isn’t designed for that reality, it didn’t just underperform. It was invisible.

AutoHarness: AI that builds its own rules and wins
What if the secret to better AI isn’t bigger models, but better tools? Researchers at Google DeepMind have shown that smaller language models can outperform larger ones when they’re given the ability to write their own code.
From engagement to fulfillment:   How Agentic AI is rewriting product metrics

Why engagement metrics are losing their meaning

Traditional UX was built around human cognitive strengths: visual hierarchy, progressive disclosure, interfaces that reward exploration. These remain valuable. But they create serious obstacles for autonomous systems that need structured, unambiguous, machine-reproducible actions.

An agent can’t appreciate a beautiful interface. It needs an endpoint.

When a product delivers value solely through visual interaction, an agent must resort to screen scraping and emulation – brittle workarounds that are unscalable and prone to failure. 

💡
Products built only for human eyes are becoming functionally invisible to the agents increasingly acting on human behalf. This is the machine-readability problem, and most product roadmaps don’t mention it once.

It’s also important to be precise about where this applies. For entertainment, social, and discovery-driven products, engagement remains the right measure,  a user lingering on Spotify isn’t failing to fulfill an intent; lingering is the intent. 

But for task-completion products – travel, finance, logistics, professional SaaS, healthcare, e-commerce – time spent is friction, not value.

The handoff nobody has designed

Even within task-completion products, there are two distinct modes that most teams have never explicitly separated. In discovery mode, the user is browsing, comparing, exploring, intent hasn’t yet crystallized, and engagement here is intentional and valuable.

In execution mode, intent has crystallized, the user knows what they want, and every additional step is friction. Agentic AI doesn’t eliminate discovery, instead it collapses execution. 

The human stays in the loop for inspiration and preference-setting; the agent takes over the moment intent crystallizes.

That boundary, the handoff from discovery to execution, is the most important design decision in your product right now. Most teams haven’t drawn it deliberately, which means someone else will redesign it for them.

Meta acquires Moltbook: The social network for AI agents
Meta’s acquisition of Moltbook highlights a growing focus on agent-to-agent systems and the infrastructure required to support them. It’s a small deal that signals bigger shifts in how AI ecosystems may evolve.
From engagement to fulfillment:   How Agentic AI is rewriting product metrics

A new metric: Return on intent

If session length was the north star of the attention economy, the age of autonomous agents demands a different compass: Return on Intent (RoInt).

RoInt asks a deceptively simple question: when a user, or an agent acting on their behalf, initiated an intent, did your product deliver the right outcome, within the right constraints, without requiring human correction?

💡
Unlike task completion rate, which only asks “did it finish?”, RoInt asks “did it finish correctly, for the right reasons?”

While 88% of organizations have implemented AI to some degree, only 23% have scaled agent systems into core business functions, according to McKinsey. That gap  between experimentation and execution – is precisely what RoInt is designed to close. 

It gives product teams a metric that reflects what agents actually do, not what dashboards were built to measure. McKinsey estimates that by 2030, agent-based systems could generate $450–$650 billion in annual revenue in mature industries. 

But the failure rate is real:

💡
Gartner predicts over 40% of agentic AI projects will be canceled by the end of 2027, citing escalating costs and unclear business value. The difference between the companies that capture that revenue and those that abandon their initiatives will not be AI adoption.
From engagement to fulfillment:   How Agentic AI is rewriting product metrics

It will all be down to whether they redesigned their products and their metrics around intent.


Is it already happening?

Some companies are already operating by this logic, even if they haven’t named it.

Stripe’s Agentic Commerce Protocol, launched in late 2025, is a direct embodiment of the principle: rather than forcing AI agents to navigate a visual checkout interface, Stripe built an open standard that lets agents transact programmatically,  treating fulfillment as the product, not the UI surrounding it.

Klarna offers the outcome data. After redesigning its customer service around agent-driven fulfillment, Klarna reported that resolution times dropped from eleven minutes to two, repeat inquiries fell by 25%, and customer satisfaction held steady. 

Their journey also illustrates the limits,  agent fulfillment works for well-defined intents, but breaks down when context is ambiguous or stakes are high. 

The discovery layer and the human override still matter. Neither company measures success by how long users spend in their product. They measure whether the intent was fulfilled,  and their RoInt is the number that tells them if it was.

💡
By 2028, Gartner predicts that 33% of enterprise software applications will include agentic AI – up from under 1% in 2024 – with agents making 15% of daily work decisions autonomously. The companies building for that reality are making product decisions today that their competitors aren’t.

So, what changes?

  • Ask a different question in your next sprint review. Stop asking “how many users are engaged with this feature?” Start asking “how many intents did this feature successfully fulfill, and how many required a human to intervene?” You don’t need new infrastructure to start. You need a new question on the whiteboard.
  • Draw the handoff line deliberately. Map your product’s discovery layer and execution layer explicitly. Where does browsing end and intent begin? That boundary is where agentic AI will enter your product first. If you haven’t designed it, you haven’t designed for what’s coming.
  • Treat your API as a product surface, not plumbing. If a well-instructed AI agent tried to complete your product’s core task today,  without screen scraping, without emulation, without a human in the loop, could it? If the answer is no, your product is invisible to the agents increasingly acting on users’ behalf. Stripe didn’t build its Agentic Commerce Protocol as a developer convenience. It built it because the interface layer is becoming optional.
  • Make RoInt visible in your analytics. Add one new dimension to your dashboard: agent-initiated interactions. Track completion rate, constraint adherence, and intervention frequency separately from human sessions. A product with rising RoInt and falling session time isn’t losing users, it’s serving them better. That distinction matters enormously when presenting to stakeholders still anchored to engagement benchmarks.

The question worth sitting with

Here is the uncomfortable question every product leader should sit with: if an AI agent could fulfill your product’s core value proposition without a single human ever opening your app, would that be a failure, or the highest possible expression of what you set out to build?

The instinct is to say failure. No sessions means no data, no upsell surface, no engagement loop. But that instinct is the attention economy talking,  and it’s increasingly out of step with what users actually want.

The products that will define the next decade won’t be the ones users love to use. They’ll be the ones users trust to act. That’s a fundamentally different design brief and a fundamentally different metric, and most product teams haven’t written either one yet.


Rohan Mitra, Product Manager at PhonePe | Building in the SaaS Space.

NVIDIA GTC 2026: The AI stack gets real

NVIDIA GTC 2026:  The AI stack gets real

At NVIDIA’s GTC 2026, CEO Jensen Huang laid out a sweeping vision for AI’s next era. From chips and agent frameworks to robotics and real-time graphics, Huang’s keynote made one thing clear: The future of AI will be built on infrastructure, and NVIDIA intends to own it.

Describing the company as “The first vertically integrated but horizontally open company,” Huang positioned NVIDIA as the foundation layer for all AI workloads, while encouraging developers, enterprises, and partners to innovate openly on top. 

For AI professionals, this signals a shift from focusing solely on models to thinking about the systems and platforms that underpin them.


Securing and scaling agentic AI

One of the keynote’s central themes was agentic AI. NVIDIA introduced NemoClaw, an open-source framework that embeds governance, safety, and privacy directly into autonomous agents. Enterprises can now deploy agents that are auditable, controllable, and compliant with internal privacy requirements.

💡
Complementing NemoClaw, the Agent Toolkit simplifies building and deploying secure agents, helping organizations accelerate AI adoption without starting from scratch. Meanwhile, the Vera Rubin platform (powered by seven new chips) optimizes large-scale training and persistent agent workloads. 

Huang even teased space-based data centers, hinting at long-term strategies to overcome terrestrial compute and energy limits.

Key enterprise benefits include:

  • Built-in safety and privacy controls for autonomous agents
  • Simplified deployment and integration into existing enterprise systems

Together, these announcements signal NVIDIA’s intent to provide a secure, scalable foundation for agentic AI across industries.

Meta acquires Moltbook: The social network for AI agents
Meta’s acquisition of Moltbook highlights a growing focus on agent-to-agent systems and the infrastructure required to support them. It’s a small deal that signals bigger shifts in how AI ecosystems may evolve.
NVIDIA GTC 2026:  The AI stack gets real


DLSS 5: Real-time AI-enhanced graphics

On the consumer side, NVIDIA unveiled DLSS 5, a real-time AI rendering system that generates photorealistic lighting and materials. Major studios such as Bethesda, Capcom, and Ubisoft are early adopters. While DLSS 5 is designed for gaming, its impact extends far beyond entertainment. 

Photorealistic rendering enables richer simulation environments, digital twins, and synthetic data, all critical for training AI agents and robotics systems.

💡
By connecting graphics, simulation, and enterprise AI, NVIDIA demonstrates that improvements in one domain can accelerate innovation across the entire ecosystem.

Expanding the AI ecosystem

Beyond agents and graphics, NVIDIA showcased platforms for robotics, autonomous vehicles, and industrial AI applications. The company’s approach is to unify these verticals under a single stack, providing scalable infrastructure and consistent development tooling. 

This ensures AI agents, robots, and autonomous systems operate efficiently across industries.

Strategic ecosystem advantages:

  • Unified infrastructure for AI agents, robotics, and simulations
  • Standardized tooling that reduces deployment friction
  • Scalable systems to support complex AI workloads

This ecosystem positioning reinforces NVIDIA’s role as the foundation for both enterprise AI and research projects.

The missing layer in enterprise AI – eBook 2026
Why most Enterprise AI fails before it starts
NVIDIA GTC 2026:  The AI stack gets real

6 impacts this will have on AI professionals

The announcements at NVIDIA GTC 2026 reshape what it means to work in AI. Here are six key impacts professionals should be preparing for:

1. A shift from model building to system design

AI professionals will need to think beyond models and focus on end-to-end systems. With platforms like NemoClaw and the Agent Toolkit simplifying development, the real challenge becomes integrating models into scalable, production-ready environments.

2. Infrastructure knowledge becomes essential

Understanding compute is no longer optional. Platforms such as the Vera Rubin platform highlight how performance, cost, and scalability are tied to infrastructure decisions. AI professionals will need a working knowledge of hardware, distributed systems, and optimization.

3. Governance and safety move to the core

As agentic AI becomes mainstream, governance is built into the stack—not added later. Tools like NemoClaw make compliance and auditability central, requiring professionals to design systems that are transparent, controllable, and aligned with regulations.

When AI judges: The risks of reasoning models in alignment
The race to build more capable AI systems has created an unexpected problem:
As we push toward more sophisticated models, we need equally sophisticated ways to evaluate and align them.
NVIDIA GTC 2026:  The AI stack gets real

4. Persistent AI systems become the norm

AI is shifting from one-off deployments to continuous, autonomous systems. Professionals will increasingly manage long-running agents that require monitoring, updates, and lifecycle management—more like operating software infrastructure than delivering static models.

5. Simulation and synthetic data go mainstream

With advances like DLSS 5, simulation has become a standard part of AI development. Professionals will need to work with synthetic data, digital twins, and virtual environments to train and validate systems before real-world deployment.

6. Ecosystem strategy becomes a career skill

As NVIDIA builds a vertically integrated stack, professionals must navigate the trade-offs between leveraging powerful platforms and avoiding vendor lock-in. Choosing the right tools (and maintaining flexibility) becomes a strategic decision.


Closing thought

The takeaway is clear: AI professionals are evolving into system architects, operators, and strategists. The future belongs to those who can not only build intelligent models, but also deploy and manage them effectively within complex, real-world environments.

How AI in life sciences is reshaping healthcare

How AI in life sciences is  reshaping healthcare

The life sciences landscape is at a defining crossroads. On one hand, the promise of scientific breakthroughs in genomics, biologics, and diagnostics is more palpable than ever.

On the other, the path to bringing these innovations to market is fraught with escalating costs, complex regulatory hoops, and the absolute imperative of patient safety.

As a product manager operating in this dynamic sphere, I see a tremendous opportunity – and a profound responsibility. The opportunity lies in leveraging Artificial Intelligence (AI) to fundamentally reshape how we develop, deliver, and monitor life-saving therapies.

The responsibility is to do so in a way that is compliant, ethical, and unwaveringly centered on the most critical stakeholder: the patient.

Let’s be clear: AI isn’t here to replace the rigorous science or the compassionate human touch that defines healthcare. Its true power lies in its ability to amplify human intelligence, automate mundane tasks, and extract meaningful patterns from vast, siloed datasets.

In doing so, we can solve some of the most persistent, core problems in life sciences.

Agentic AI & product metrics: From engagement to fulfillment
As AI agents begin executing tasks on users’ behalf, traditional engagement metrics are becoming less meaningful. In the age of agentic AI, product teams may need a new north star: measuring whether user intent was successfully fulfilled.
How AI in life sciences is  reshaping healthcare

Tackling the core problems in life sciences with AI

💡
For years, as a product manager in life sciences, I have grappled with a consistent set of challenges. These are the problems that bottleneck innovation and increase the risk of product failure:

Slow and costly drug discovery: The traditional “one size fits all” approach to drug development is incredibly slow, costly, and has a high failure rate. Identifying a promising lead compound can take years of painstaking lab work.

Patient recruitment for clinical trials: One of the primary reasons for clinical trial delays is the difficulty in identifying and enrolling eligible patients. This directly translates to increased costs and time-to-market.

Complex, ever-changing regulations: Navigating the complex landscape of FDA, EMA, and other regulatory bodies is a monumental task. Ensuring global compliance is not just a burden; it’s a prerequisite for market access.

Suboptimal patient engagement: Even with a miracle drug, poor patient adherence can significantly diminish its real-world efficacy. Understanding the patient journey and keeping them engaged is a persistent challenge.

Inefficient supply chain management: From managing delicate biologics to tracking post-market surveillance data, the life sciences supply chain is incredibly complex. A single misstep can have catastrophic consequences for patient safety.

How AI in life sciences is  reshaping healthcare

Generative AI Summit Austin, 2026
Catch up on every session from Generative AI Summit Austin,with sessions from the likes of Stability AI, Meta, Google and more.
How AI in life sciences is  reshaping healthcare

By deploying AI effectively, we can begin to address these core problems, measured against critical Key Performance Indicators (KPIs):

Time-to-market: The speed at which we move from drug discovery to market approval.
Trial recruitment rate: The speed and accuracy of identifying and enrolling suitable patients for clinical trials.
Compliance error rate: The number of identified compliance gaps or audit findings.
Patient adherence and engagement: Measurable improvements in how patients interact with their treatments and care teams.
Patient outcome: Most importantly, the real-world health outcomes for the patients we serve.


AI agents: Our partners in progress

So, how do we translate this potential into reality? The key is to think of AI not as a black box, but as a system of intelligent agents, each with a specific purpose, working in concert with human experts.

Let’s explore some tangible examples in life sciences:

1. The discovery & development agent

The vision: Shift from a purely linear R&D process to a data-driven, accelerated discovery model.

The application: AI algorithms can analyze millions of biomedical publications, patent databases, and real-world evidence to predict promising molecule interactions, simulate clinical trial outcomes, and identify potential off-target effects. This is about making smarter choices early on.

Example: Companies are using AI to model the structure of proteins and predict how small molecules might bind to them, drastically accelerating the early stages of drug discovery. This reduces the number of initial physical compound tests required from millions to a highly targeted subset.

Patient-centric view: By accelerating discovery and improving the likelihood of a drug’s success, we bring life-saving therapies to patients faster. This agent also helps us design trials that are more likely to deliver meaningful results for specific patient subpopulations, moving us closer to the promise of personalized medicine.


2. The patient & trial matching agent

The vision: Streamline clinical trials by quickly and accurately identifying and connecting with the right patients.

The application: This agent can analyze Electronic Health Records (EHRs), lab results, and genomic data to identify patients who meet the strict eligibility criteria for a clinical trial. Natural Language Processing (NLP) is used to read through complex clinical notes that are often hard for traditional search methods to parse.

Example: A major pharmaceutical company deployed an AI solution to screen EHR data across a network of hospitals. The system identified thousands of potential candidates for a complex oncology trial in a fraction of the time it would have taken a human team, significantly cutting trial recruitment timelines.

Patient-centric view: This directly addresses one of the biggest bottlenecks in bringing new therapies to market. For a patient waiting for a new treatment option, this could mean the difference between getting access to a trial and missing an opportunity.

The key is to design these agents to work ethically, with full patient consent and data privacy at the core.


3. The regulatory compliance & pharmacovigilance agent

The vision: Proactively monitor for adverse events and ensure continuous compliance across the product lifecycle.

The application: This agent uses NLP and machine learning to sift through social media posts, medical forums, patient support group data, and traditional medical literature to identify potential safety signals (adverse events) that might not be captured in formal reporting systems.

It can also be used to automatically scan new regulatory guidelines and update internal compliance protocols, reducing the risk of human error.

Example: By analyzing natural language in patient forums, an AI model flagged a pattern of severe fatigue associated with a new treatment that hadn’t been prominent in clinical trials.

This early warning allowed the product team to proactively update safety labels and investigate further, prioritizing patient safety.

Patient-centric view: Compliance isn’t just about avoiding fines; it’s about protecting patients.

By automating the “brute force” work of pharmacovigilance and compliance mapping, this agent helps ensure that the real-world performance of a drug is continuously monitored, allowing for rapid intervention if safety issues are detected.

This builds trust with patients and regulators alike.

NVIDIA GTC 2026: Jensen Huang unveils the AI stack
At NVIDIA’s GTC 2026, CEO Jensen Huang laid out a sweeping vision for AI’s next era. From chips and agent frameworks to robotics and real-time graphics, Huang’s keynote made one thing clear: The future of AI will be built on infrastructure, and NVIDIA intends to own it.
How AI in life sciences is  reshaping healthcare

The road ahead: Co-creation, not replacement

The path forward for life sciences is not about a grand “takeover” by AI. It’s about a collaborative future where AI enables and empowers. As product managers, we are the architects of this future.

Our role is to identify the core problems, define the relevant KPIs, and champion the deployment of AI agents that are not only powerful but also patient-centric by design.

The future of healthcare is intelligent, and it’s built on a foundation of data, collaboration, and an unwavering commitment to the people we serve. Let’s embrace AI, not as a shortcut, but as a critical tool that helps us deliver on the promise of better health for all.


About the author:

Shivakumaran Venkataraman is an experienced Product Manager with a focus on delivering innovative, data-driven solutions in the life sciences space. He is passionate about leveraging technology to improve patient outcomes while navigating the complexities of the healthcare landscape. Find Shivakumaran exploring the intersection of AI, real-world data, and patient-centric product strategy.

Top 20 tech leaders in New York

Top 20 tech leaders  in New York

From finance and healthcare to government and academia, a growing cadre of Chief AI Officers (CAIOs) and CISOs are shaping strategy, driving adoption, and defining the future of AI in the city. 

For anyone interested in meeting these trailblazers and hearing firsthand how they are applying AI in real-world scenarios, the AIAI New York on June 04, 2026 is the perfect opportunity to connect with these experts and explore the latest innovations in AI.

Find out more


Top 5 highlighted applied AI leaders in New York

1. Denis Yarats – Co‑Founder & CTO, Perplexity AI

Denis Yarats is an NYU‑trained computer scientist and one of the driving forces behind Perplexity AI, a rapidly growing intelligent search and generative AI platform.

His research in reinforcement learning and scalable deep learning helped establish his reputation in academic and applied AI communities before he transitioned into AI product leadership.

Yarats’s work bridges cutting‑edge research and real user‑facing systems, showcasing how foundational AI innovation can move from theory to practice.


2. Rob Fergus – NYU AI Research Pioneer

Rob Fergus is a professor at NYU’s Courant Institute of Mathematical Sciences and a well‑known figure in the deep learning research world. His contributions to computer vision, convolutional neural networks, and machine learning theory have been widely cited and form part of the foundation of modern AI systems used across industries.

Fergus’s academic leadership has helped elevate New York’s role in AI research.


3. Meredith Whittaker – AI Power & Privacy Advocate

Meredith Whittaker is the president of Signal and a leading voice on the risks and power dynamics of AI. Based in New York, she has been at the forefront of conversations around surveillance, data privacy, and the societal impact of large-scale AI systems.

Her work challenges how AI is built and deployed—making her one of the most influential critics shaping responsible AI today.


4. Dan Huttenlocher – AI Systems & Applied Research Leader

Dan Huttenlocher is the dean of computing at MIT and has deep ties to New York’s tech ecosystem through his leadership at Cornell Tech. His work sits at the intersection of AI systems, academia, and real-world application, helping bridge cutting-edge research with practical deployment across industries.


5. Andrew Kimball – NYC Tech Ecosystem & AI Economy Strategist

As President & CEO of the New York City Economic Development Corporation (NYCEDC), Andrew Kimball plays a strategic role in shaping the city’s AI ecosystem, economic policy, and growth initiatives. His work emphasizes AI‑driven industry expansion, infrastructure investment, and talent development, contributing to New York’s standing as a vital hub for innovation.


Other notable AI leaders in New York (6–20)

These leaders hold senior AI or technology roles, and several are confirmed speakers at the AIAI New York Summit (June 04,  2026), where you can hear from them live.

6. Michael Domanic – VP, AI, UserTesting

Leads AI strategy at UserTesting, applying machine learning to enhance user experience and research insights. Featured speaker at CAIO New York 

7. Ash Dhupar – Chief Data & Analytics Officer, Analog Devices

Oversees data and AI programs, ensuring analytics strategy drives engineering and operational value. Featured speaker at CISO New York

8. Ravi Sarkar – Enterprise CTO, Technology Strategy, Microsoft

Guides enterprise AI strategy for large clients, including adoption of scalable AI and cloud‑native systems.

9. Frank Indiviglio – Chief Technology Officer, NOAA

Directs AI and data science initiatives for environmental forecasting and modeling systems. Featured speaker at CISO Summit

10. Daniel Gremmell – Chief Data Officer, Zinnia

Leads AI and analytics efforts to transform enterprise data into strategic insights. Featured speaker at CAIO Summit

11. Girish Gajwani – VP, Architect, Securitized Products Technology, Barclays

Drives AI infrastructure for financial products and models in Barclays’ New York office. Featured speaker at CISO Summit

12. Mark Ritzmann – Chief Information Officer, Columbia University

Oversees technology and research computing that supports AI work across the university.

13. Vijay Yadav – CTO & Founding Engineer, Brooklyn Health

Leads technical strategy for AI‑enabled healthcare solutions focused on community impact.

14. Hilary Mason – Applied AI & Data Science Leader

Hilary Mason is a leading figure in applied AI and data science, and co-founder of Fast Forward Labs (acquired by Cloudera). Based in New York, she has spent her career helping enterprises understand and adopt emerging AI technologies. 

15. Kuntal Dutta – Global Head of Information Security Data, Analytics & Insights, BNY Mellon

Leads AI‑driven analytics for cybersecurity and risk intelligence at BNY Mellon.

16. Dana Kilcrease – Chief Information Security Officer, Berkeley College

Guides AI governance and data protection strategies in educational technology.

17. Srivatsan Raghavan – Chief Information & Technology Officer, OHLA USA

Oversees AI and analytics integration for large infrastructure and operational systems.

18. Demis Hassabis – AGI Vision & Frontier AI Leader

Demis Hassabis is the CEO of DeepMind and one of the most influential figures in modern AI. While based between London and the U.S., his work has global impact, including strong ties to New York’s AI and enterprise ecosystem. 

19. Thomas Wolf – Open-Source AI & Developer Ecosystem Leader

Thomas Wolf is the co-founder of Hugging Face, one of the most important platforms in the AI ecosystem. With a major presence in New York, Hugging Face has become the backbone of open-source AI development.

20. Davood Shamsi – Executive Director, AI/ML, JP Morgan Chase

Leads machine learning applications for predictive modeling and financial decision support. featured speaker at CAIO Summit


Why these leaders matter

New York’s AI leaders are influencing public policy, advancing healthcare innovation, and driving the next generation of AI research.

From integrating machine learning into complex financial systems to applying predictive analytics for city governance, these executives represent the cutting edge of AI leadership in the city. 

Many of them also share their expertise at conferences and industry events, providing invaluable insights into how AI is transforming organizations and society.

Join some of these amazing leaders live at AIAI New York, June 2026

To meet these trailblazers in person and gain exclusive insights into the future of AI, join us at CAIO and CISO Summit New York on June 04, 2026.

This is your chance to connect directly with top AI leaders, hear firsthand about their work, and explore the latest innovations shaping industries across New York.

Join a curated room of 300+ applied AI and security leaders who are actually shipping AI at scale.

This invite-only summit is exclusive to technology leaders, ensuring quality discussion and insight into the exact challenges you face in production.

Request to join | View full summit page

AI swarms are coming: Here’s why it matters

AI swarms are coming:  Here’s why it matters

For the past two years, the dominant mental model of AI has been simple: one powerful model, one prompt, one response. Think copilots, chatbots, and assistants, polished, helpful, and fundamentally, solo performers.

That model is now evolving.

A new paradigm is emerging, one where AI systems collaborate. These systems operate as hundreds or even thousands of coordinated AI agents working together. 

Welcome to the age of agentic AI and multi-agent systems.

How New York’s tech leaders are shaping the future
Artificial intelligence is transforming industries at breakneck speed, and New York is at the heart of this revolution.
AI swarms are coming:  Here’s why it matters

From lone models to multi-agent systems

The shift from single models to multi-agent AI systems represents an architectural evolution.

Instead of assigning planning, reasoning, execution, and verification to a single model, these responsibilities are distributed across specialized agents.

  • A planner agent maps the task and defines strategy
  • Research agents gather and filter relevant information
  • Executor agents carry out actions and interact with tools
  • Critic agents review outputs and improve quality

Individually, each agent focuses on a narrow capability. Together, they form a distributed AI system with greater flexibility, adaptability, and depth. The result resembles a coordinated team rather than a single intelligence.


Why are AI swarms gaining momentum now?

Multi-agent systems have existed for years, yet several recent advances have accelerated their adoption.

Large language models now handle autonomous sub-tasks with greater reliability, while modern AI orchestration frameworks make it easier to coordinate multiple agents within a single workflow. 

At the same time, scalable cloud infrastructure enables parallel execution at a level that supports hundreds or thousands of agents operating simultaneously.

These developments have created a new class of systems designed for parallelism, coordination, and scalable AI automation, opening the door to more complex and dynamic use cases.

Solving accountability in multi-agent AI systems
All AI systems can fail, but now we can trace exactly who’s responsible. Implicit Execution Tracing (IET) embeds invisible signatures in AI outputs, making multi-agent systems accountable, auditable, and tamper-proof.
AI swarms are coming:  Here’s why it matters


What AI swarms enable for complex problem solving

AI swarms perform especially well in environments that require multi-step reasoning, open-ended exploration, and parallel processing.

  • Problems can be decomposed into smaller parallel tasks
  • Multiple solution paths can be explored simultaneously
  • Outputs can be compared, refined, and improved iteratively

In practice, this supports use cases such as automated research workflows, large-scale simulations, and adaptive decision-making systems. Rather than relying on a single path, the system evaluates multiple possibilities and converges on higher-quality results over time.


So, what does this mean for AI professionals?

The shift toward agentic AI systems introduces a new set of expectations for AI professionals.

Building effective multi-agent systems now involves orchestration, where developers design how agents communicate, collaborate, and share context without stepping on each other’s toes. State management becomes critical, since each agent operates with its own memory, assumptions, and occasional moments of confusion. 

Engineers also need to design resilient systems that handle errors gracefully while keeping performance stable across distributed components.

Observability plays a central role as well. Debugging a multi-agent system often feels less like fixing code and more like mediating a disagreement between highly confident coworkers.

💡
You trace interactions, identify where things drifted off course, and refine coordination strategies so the system behaves more like a team and less like a group chat gone wrong.

As a result, the role of the AI engineer is expanding toward AI systems design, AgentOps, and distributed AI architecture, with a stronger emphasis on building scalable, cooperative ecosystems that actually deliver outcomes.


The current challenges of agentic AI

AI swarms introduce a new layer of complexity that comes with trade-offs.

Coordination overhead increases as more agents are added, and compute costs rise with large-scale parallel execution. In addition, emergent behavior within multi-agent systems can produce unexpected or inconsistent outcomes, especially when agents interact in unanticipated ways.

In some cases, systems generate many similar outputs without meaningful improvement in accuracy, highlighting the importance of strong evaluation frameworks. Ensuring reliability requires careful design and well-defined feedback loops.


The future of autonomous AI systems

The trajectory of agentic AI points toward increasingly autonomous and persistent systems.

💡
Future architectures are likely to include agents that operate continuously, adapt based on feedback, and retain memory across tasks. These systems will integrate into broader ecosystems where agents interact with tools, services, and other agents to complete complex workflows.

This evolution supports the development of end-to-end AI automation, where coordinated systems handle planning, execution, and optimization with minimal human intervention.


Final thoughts

The most important shift involves organization.

AI is evolving into coordinated, multi-agent intelligence, where systems are designed around collaboration rather than isolation.

As coordination and communication become central to AI development, complexity increases alongside capability. The result is a new generation of systems built to operate at scale, solve complex problems, and deliver outcomes through cooperation.

The future of AI centers on networks of intelligent agents working together to achieve shared goals.

Why AI can’t reliably explain itself (yet)
What if AI could explain itself? As language models scale in size and complexity, that possibility has drawn growing excitement, and hope. But new research from MIT, Technion, and Northeastern University suggests the reality is much messier, and more concerning…
AI swarms are coming:  Here’s why it matters

Fighting financial crime with hybrid AI

Fighting financial crime with  hybrid AI

I’ve been in the data game long enough to see plenty of AI projects crash and burn. 

I started my career building data warehouses for telcos and banks, then moved into machine learning consulting, where I led hundreds of projects across industries. Now I’m leading data analytics and machine learning at Phenom, and I want to share something we recently built that actually works.

Let me be clear about what I mean when I say “Gen AI” here. I’m talking about LLMs and the tools built on top of them. The “old school ML” I’ll reference means those low-complexity supervised models we’ve been using for years, the ones that are fast, cheap, and reliable by nature.

AI swarms are here: How autonomous agents work together
If the last wave of AI felt like hiring a very smart intern, this one feels more like managing an entire organization that never sleeps (and occasionally argues with itself).
Fighting financial crime with  hybrid AI

The reality of building AI in fintech

Phenom provides banking solutions for SMEs across Europe, but at our core, we’re a B2B fintech scale-up. Each of these words carries weight.

Being B2B means every single client counts. We can’t mess around with client communications or operations. Everything that touches our clients’ needs to meet a certain standard, no exceptions.

Being a fintech means we love technology, sure, but we’re also bound by regulations. The Financial Crimes Enforcement Network doesn’t care how innovative your solution is if it doesn’t meet compliance standards.

And being a scale-up? That means we can’t afford AI theater. We have some budget for innovation and experimentation, but every investment needs to demonstrate real efficiency gains and positive ROI.

These constraints shaped our entire approach to AI and machine learning at Phenom. We’ve established two fundamental pillars that guide everything we build.

  • First, we successfully convinced leadership (all the way up to the board) that while AI is nice, having a solid data foundation and platform is even better. When you’re dealing with regulatory reporting or enabling better tactical and strategic business decisions, that foundation matters more than any flashy AI feature.
  • Second, we developed clear ground rules for when to use which technology. When we need stability and structured signals, we reach for traditional machine learning first. When we’re dealing with messy input data like customer reviews or unstructured text, we consider generative AI. 

High-risk scenarios involving financial crime, regulations, or customer care always get hybrid solutions with humans in the loop. Low-risk internal use cases? That’s where we let AI shine and can afford the occasional mistake.

Fighting financial crime with  hybrid AI

For expert advice like this straight to your inbox every other Friday, sign up for Pro+ membership.

You’ll also get access to 300+ hours of exclusive video content, a complimentary Summit ticket, and so much more.

So, what are you waiting for?


Get Pro+

5 lessons we can learn from Sora: Hype vs reality

5 lessons we can learn from  Sora: Hype vs reality

For a brief moment, Sora seemed like the future of AI video generation. Then, almost as quickly as it appeared, it quietly disappeared.

Sora’s rise and disappearance offer a rare glimpse into the practical realities of developing cutting-edge AI. For AI leaders, engineers, and decision-makers, it provides a real-world view of what it takes to build scalable, commercially viable AI products. 

These lessons are essential for anyone hoping to turn AI research into lasting impact (without losing their sanity along the way).


1. Compute costs can limit even the most advanced AI models

Sora pushed the boundaries of multimodal AI, generating high-quality video from simple text prompts. The results were impressive, showing what AI can do when it combines natural language understanding with visual synthesis. 

Behind the shiny demos, however, economics told a different story…

Video generation consumes far more computational resources than text or image generation. 

Each video requires multiple GPU passes, massive memory bandwidth, and precise rendering pipelines. Running Sora at scale required significant GPU infrastructure, which made operating costs extremely high.

For organizations investing in AI infrastructure, the lesson is clear:

If your AI model’s scalability relies on high compute costs, innovation alone will not guarantee success. Even the fanciest AI can’t survive on wishful thinking.


2. Viral AI products may create lasting value

Sora captured immediate attention as a breakthrough in AI content generation, with early adoption surging thanks to curiosity and experimentation.

Engagement dropped quickly. Novelty does not equal necessity. 

While Sora impressed users with creative demos, it struggled to offer repeatable value for daily use. Tools integrated into professional workflows, such as AI copilots, automation platforms, or enterprise AI solutions, provide consistent value.

💡
For product teams, the takeaway is straightforward: building viral demos is exciting, but retention drives long-term success. Products must solve recurring problems or integrate seamlessly into user workflows.
  • Build for retention, not just reach
  • Prioritize workflow integration over wow-factor

The most successful AI products balance novelty with practicality, offering value that users return to day after day. Think of it as the difference between a fleeting TikTok trend and a tool you actually rely on at work.


3. Monetization strategies must be clear from day one

Sora also highlighted the challenges of monetizing cutting-edge AI technology. Its positioning in the AI business model landscape was unclear:

  • Expensive for mass free usage
  • Entertainment-focused for enterprise budgets
  • Early for a well-defined pricing strategy

While Sora generated excitement, companies struggled to find a path to revenue. The market rewards AI applications where ROI is measurable, including:

  • AI for productivity
  • AI for software development
  • AI for operational efficiency

These areas are experiencing accelerating enterprise AI adoption. Clear monetization strategies (subscription, usage-based, or enterprise licensing) turn AI innovation into sustainable products. In short: hype gets attention, but cash keeps the lights on.


4. Trust, IP, and governance are central concerns

Like many generative AI systems, Sora raised urgent questions about:

  • Copyright and intellectual property
  • Deepfake risks and synthetic media misuse
  • Ownership of AI-generated content

For companies deploying AI at scale, these issues are critical. Organizations must establish strong governance frameworks, compliance strategies, and ethical guidelines. 

💡
Trust is a core part of product design. Users and enterprises expect AI outputs to be compliant. Addressing governance can improve adoption and reduce legal or operational risks. Think of governance as the seatbelt of AI: you might be able to drive without it, but do you really want to test that theory?

5. Focus and resource allocation determine AI winners

Sora demonstrates the importance of focus and strategic resource allocation. OpenAI shifted its resources from Sora toward higher-impact areas, including:

In a world of limited compute, talent, and capital, every AI initiative competes for attention and investment. Success is determined by strategic prioritization.

The most effective AI strategy is to focus on initiatives that scale.

This requires leadership teams to make careful choices, balancing short-term excitement with long-term impact. Scaling AI involves building products that deliver sustained value.


Conclusion: From hype to execution

Sora illustrates a broader shift in the AI landscape. We are moving from:

  • Experimental innovation to Scalable AI Systems
  • Eye-catching demos to Production-Grade AI Applications
  • Hype-driven narratives to ROI-Driven Decision-Making

The future of AI rewards teams that combine technical excellence with practical deployment. Successful AI products deliver consistent, measurable value while navigating the constraints of cost, infrastructure, and trust.

Sora shows that while hype opens doors, execution defines winners. Today’s AI professionals must focus on building products that actually work in the real world, and maybe have a little fun along the way…

Solving accountability in multi-agent AI systems
All AI systems can fail, but now we can trace exactly who’s responsible. Implicit Execution Tracing (IET) embeds invisible signatures in AI outputs, making multi-agent systems accountable, auditable, and tamper-proof.
5 lessons we can learn from  Sora: Hype vs reality