Auto-FL-Research: Agentic Search for Federated Learning Algorithms
arXiv:2607.01366v1 Announce Type: new Abstract: Federated learning (FL) research often depends on many small but consequential algorithmic choices: optimizer variants, server ag...
Agent4cs: A Multi-agent System for Code Summarization in Large Hierarchical Codebases
arXiv:2607.01425v1 Announce Type: new Abstract: Understanding large, complex codebases, especially those with obfuscated structures and incomplete documentation, remains a signi...
Beyond Next-Token Prediction: An RLVR Proof of Concept for Tool-Use Agents on Atlassian Workflows
arXiv:2607.01465v1 Announce Type: new Abstract: Large language models are trained to predict the next token, not to act inside a specific API. In niche enterprise SaaS workflows...
RuleChef: Grounding LLM Task Knowledge in Human-Editable Rules
arXiv:2607.01293v1 Announce Type: new Abstract: We present RuleChef, a framework that uses large language models (LLMs) to generate executable rules for NLP tasks such as text c...
RusFinChain: A Russian Benchmark for Verifiable Chain-of-Thought Reasoning in Finance with Fuzzy-Aligned Evaluation
arXiv:2607.01388v1 Announce Type: new Abstract: Multi-step symbolic reasoning is essential for robust financial analysis, yet most benchmarks neglect intermediate reasoning step...
FaithMed: Training LLMs For Faithful Evidence-Based Medical Reasoning
arXiv:2607.01440v1 Announce Type: new Abstract: Faithful reasoning is essential in medicine, where clinical decisions require transparent justification grounded in reliable evid...
OpenAI frontier models and Codex are now available on AWS
OpenAI frontier models and Codex are now generally available on AWS, giving enterprises a new path to build with OpenAI through the AWS environments, controls, and procurement w...
Databricks brings GPT-5.5 to enterprise agent workflows
Databricks uses GPT-5.5 for enterprise agent workflows after the model set a new state of the art on the OfficeQA Pro benchmark.
Introducing GPT-5.4 mini and nano
GPT-5.4 mini and nano are smaller, faster versions of GPT-5.4 optimized for coding, tool use, multimodal reasoning, and high-volume API and sub-agent workloads.
CyberSecEval 2 - A Comprehensive Evaluation Framework for Cybersecurity Risks and Capabilities of Large Language Models
CyberSecEval 2 - A Comprehensive Evaluation Framework for Cybersecurity Risks and Capabilities of Large Language Models
ScarfBench: Benchmarking AI Agents for Enterprise Java Framework Migration
ScarfBench: Benchmarking AI Agents for Enterprise Java Framework Migration
Introducing the Gemini 2.5 Computer Use model
Available in preview via the API, our Computer Use model is a specialized model built on Gemini 2.5 Pro’s capabilities to power agents that can interact with user interfaces.
OpenCode, Pi, and Goose: Three Layers of the AI Agent Stack
Article URL: https://gist.github.com/AIMOWAY/bd8007c8f834a9bc83c71e3178239d75 Comments URL: https://news.ycombinator.com/item?id=48779685 Points: 2 # Comments: 0
Is it agentic enough? Benchmarking open models on your own tooling
Is it agentic enough? Benchmarking open models on your own tooling
Gemma Scope 2: helping the AI safety community deepen understanding of complex language model behavior
Open interpretability tools for language models are now available across the entire Gemma 3 family with the release of Gemma Scope 2.
Strengthening our Frontier Safety Framework
We’re strengthening the Frontier Safety Framework (FSF) to help identify and mitigate severe risks from advanced AI models.
The Termi Protocol: Watch AI Coding Agents Build in 3D
Article URL: https://termiprotocol.com/ Comments URL: https://news.ycombinator.com/item?id=48780405 Points: 1 # Comments: 1
Show HN: Durable AI agents without the workflow engine
Article URL: https://www.noworkflows.dev/ Comments URL: https://news.ycombinator.com/item?id=48780400 Points: 3 # Comments: 0
Procedural Memory Distillation: Online Reflection for Self-Improving Language Models
arXiv:2607.01480v1 Announce Type: new Abstract: Reinforcement learning with verifiable rewards (RLVR), along with recent selfdistillation variants such as SDPO, evaluates each r...
Janus: a Playground for User-Involved Agentic Permission Management
arXiv:2607.01510v1 Announce Type: new Abstract: AI agents that autonomously execute tool calls on a user's behalf raise pressing questions about permission management: what role...