Commonplace: Self-hosted, privacy-tiered memory for your AI agents
Article URL: https://github.com/itsmeduncan/commonplace Comments URL: https://news.ycombinator.com/item?id=48740235 Points: 1 # Comments: 0
ScarfBench: Benchmarking AI Agents for Enterprise Java Framework Migration
ScarfBench: Benchmarking AI Agents for Enterprise Java Framework Migration
Recursive Self-Evolving Agents via Held-Out Selection
arXiv:2606.28374v1 Announce Type: new Abstract: LLM agents are increasingly improved without weight updates by evolving a natural-language artifact, such as reflections, workflo...
GPTNT: Benchmarking Real-Time Collaboration Between Multimodal Agents on Keep Talking And Nobody Explodes
arXiv:2606.28514v1 Announce Type: new Abstract: Multimodal models are increasingly deployed to solve tasks collaboratively with humans or other artificial agents. Existing bench...
IMCBench: A benchmark for multimodal LLMs in Image-grounded Medical Conversations
arXiv:2606.28556v1 Announce Type: new Abstract: Recent advances in large language models and vision-language models have enabled reasoning over multimodal data, offering opportu...
Developmental Trajectories of Situation Modeling and Mentalizing in Transformer Language Models
arXiv:2606.28524v1 Announce Type: new Abstract: Recent work suggests that Large Language Models (LLMs) are sensitive to the belief states of agents described by text, as measure...
A French OSCE Dialogue Dataset and Controllable Virtual Patient System for Clinical Training
arXiv:2606.28526v1 Announce Type: new Abstract: The clinical and communication skills of medical students are commonly assessed through Objective Structured Clinical Examination...
AnTenA: Actionable and Explainable Tensor Analysis System with Large Language Models
arXiv:2606.28708v1 Announce Type: new Abstract: Accurately explaining hidden patterns in multi-aspect data has typically been done by leveraging labels and/or accompanying auxil...
OpenAI frontier models and Codex are now available on AWS
OpenAI frontier models and Codex are now generally available on AWS, giving enterprises a new path to build with OpenAI through the AWS environments, controls, and procurement w...
Databricks brings GPT-5.5 to enterprise agent workflows
Databricks uses GPT-5.5 for enterprise agent workflows after the model set a new state of the art on the OfficeQA Pro benchmark.
Introducing GPT-5.4 mini and nano
GPT-5.4 mini and nano are smaller, faster versions of GPT-5.4 optimized for coding, tool use, multimodal reasoning, and high-volume API and sub-agent workloads.
CyberSecEval 2 - A Comprehensive Evaluation Framework for Cybersecurity Risks and Capabilities of Large Language Models
CyberSecEval 2 - A Comprehensive Evaluation Framework for Cybersecurity Risks and Capabilities of Large Language Models
Introducing the Gemini 2.5 Computer Use model
Available in preview via the API, our Computer Use model is a specialized model built on Gemini 2.5 Pro’s capabilities to power agents that can interact with user interfaces.
Show HN: Mimir – local-first encrypted memory for AI agents (single Rust binary)
Article URL: https://github.com/Perseus-Computing-LLC/mimir Comments URL: https://news.ycombinator.com/item?id=48739468 Points: 1 # Comments: 2
Claude Code Skills: 98 AI architectures, Haiku at 93% of Fable 5 quality
Article URL: https://github.com/GPire/claude-skills-swarm Comments URL: https://news.ycombinator.com/item?id=48740141 Points: 1 # Comments: 0
Is it agentic enough? Benchmarking open models on your own tooling
Is it agentic enough? Benchmarking open models on your own tooling
Gemma Scope 2: helping the AI safety community deepen understanding of complex language model behavior
Open interpretability tools for language models are now available across the entire Gemma 3 family with the release of Gemma Scope 2.
Strengthening our Frontier Safety Framework
We’re strengthening the Frontier Safety Framework (FSF) to help identify and mitigate severe risks from advanced AI models.
Search for Truth from Reasoning: A Dynamic Representation Editing Framework for Steering LLM Trajectories
arXiv:2606.28589v1 Announce Type: new Abstract: Current approaches to enhance Large Language Model (LLM) reasoning, such as Chain-of-Thought and "Wait" prompts, primarily encour...
Aristotelian Virtue Profiling of LLMs through Ethical Dilemmas
arXiv:2606.28683v1 Announce Type: new Abstract: Large Language Models (LLMs) often face ethical tradeoffs in which several responses may be defensible but express different prio...