RIFT-Bench: Dynamic Red-teaming For Agentic AI Systems
arXiv:2606.23927v1 Announce Type: new Abstract: Agentic AI systems powered by large language models (LLMs) are rapidly evolving into autonomous decision-making systems, exposing...
Neuro-Symbolic Drive: Rule-Grounded Faithful Reasoning for Driving VLAs
arXiv:2606.23938v1 Announce Type: new Abstract: Driving VLA models incorporating Chain-of-Thought (CoT) reasoning are attractive because they leverage pretrained VLM representat...
Critique of Agent Model
arXiv:2606.23991v1 Announce Type: new Abstract: What is an agent? What constitutes agency? With the rise of Large Language Model (LLM) systems marketed as ``coding agents'', ``A...
EXPO-SQL: Execution-based Clause-level Policy Optimization for Text-to-SQL
arXiv:2606.23693v1 Announce Type: new Abstract: Text-to-SQL enables users to query databases using natural language by generating executable SQL queries. Recent methods have inc...
Ground Then Rank: Revisiting Knowledge-Based VQA with Training-Free Entity Identification
arXiv:2606.23881v1 Announce Type: new Abstract: Knowledge-Based Visual Question Answering (KB-VQA) requires grounding visual queries to external knowledge beyond directly observ...
MedBench v5: A Dynamic, Process-Oriented, and Hallucination-Aware Benchmark for Clinical Multimodal Models
arXiv:2606.24155v1 Announce Type: new Abstract: Existing medical AI benchmarks lack process visibility, atomic skill evaluation, and integrated hallucination detection. We intro...
OpenAI frontier models and Codex are now available on AWS
OpenAI frontier models and Codex are now generally available on AWS, giving enterprises a new path to build with OpenAI through the AWS environments, controls, and procurement w...
Databricks brings GPT-5.5 to enterprise agent workflows
Databricks uses GPT-5.5 for enterprise agent workflows after the model set a new state of the art on the OfficeQA Pro benchmark.
Introducing GPT-5.4 mini and nano
GPT-5.4 mini and nano are smaller, faster versions of GPT-5.4 optimized for coding, tool use, multimodal reasoning, and high-volume API and sub-agent workloads.
Introducing computer use in Gemini 3.5 Flash
Introducing computer use in Gemini 3.5 Flash
CyberSecEval 2 - A Comprehensive Evaluation Framework for Cybersecurity Risks and Capabilities of Large Language Models
CyberSecEval 2 - A Comprehensive Evaluation Framework for Cybersecurity Risks and Capabilities of Large Language Models
Introducing the Gemini 2.5 Computer Use model
Available in preview via the API, our Computer Use model is a specialized model built on Gemini 2.5 Pro’s capabilities to power agents that can interact with user interfaces.
Is it agentic enough? Benchmarking open models on your own tooling
Is it agentic enough? Benchmarking open models on your own tooling
Same flaw, opposite verdict: what counts as a vulnerability in AI agents?
Article URL: https://medium.com/@nikrig/same-flaw-opposite-verdict-ai-agents-cant-agree-what-counts-as-a-security-vulnerability-995060e5b0a5 Comments URL: https://news.ycombinat...
A New Framework for Evaluating Voice Agents (EVA)
A New Framework for Evaluating Voice Agents (EVA)
Gemma Scope 2: helping the AI safety community deepen understanding of complex language model behavior
Open interpretability tools for language models are now available across the entire Gemma 3 family with the release of Gemma Scope 2.
Connect Your AI Agent to Google Sheets
Article URL: https://quickchat.ai/post/connect-ai-agent-to-google-sheets Comments URL: https://news.ycombinator.com/item?id=48665781 Points: 4 # Comments: 0
Mycelium – codebase memory for AI coding agents
Article URL: https://www.getmycelium.net/ Comments URL: https://news.ycombinator.com/item?id=48664937 Points: 3 # Comments: 0
Safe and Generalizable Hierarchical Multi-Agent RL via Constraint Manifold Control
arXiv:2606.24010v1 Announce Type: new Abstract: Multi-agent systems are widely used in safety-critical applications that require coordinated behavior under strict safety constra...
Can Language Model Agents be Helpful Circuit Explainers in Mechanistic Interpretability?
arXiv:2606.24026v1 Announce Type: new Abstract: Mechanistic interpretability has made substantial progress in automatically localizing circuits, but explaining what localized co...