Show HN:An AI agent that applies to jobs for me (Playwright,GPT5.4 form filling)
Article URL: https://github.com/torontodeveloper/job-application-agent Comments URL: https://news.ycombinator.com/item?id=48752969 Points: 2 # Comments: 3
Show HN: CLI that helps AI agents avoid vulnerable dependencies
deptrust is a CLI that checks package versions for known vulnerabilities across npm, PyPI, crates.io, Go modules, RubyGems, NuGet, Maven, Packagist, pub.dev, CocoaPods, Hex.pm,...
What Drives Interactive Improvement from Feedback?
arXiv:2606.30774v1 Announce Type: new Abstract: We study when natural-language feedback produces improvement beyond the gains obtainable from repeated attempts alone. In multi-t...
Contrastive Reflection for Iterative Prompt Optimization
arXiv:2606.30840v1 Announce Type: new Abstract: LLM agents are becoming central to information retrieval: they issue retrieval queries, synthesize answers, and increasingly serv...
Beyond expert users: agents should help users construct preferences, not just elicit them
arXiv:2606.30863v1 Announce Type: new Abstract: Agents typically assume an expert user -- one with well-formed preferences about what they want -- and default to clarifying ques...
When Calibration Rankings Reverse: Accuracy-Controlled Evaluation for Fair Comparison of LLMs
arXiv:2606.30814v1 Announce Type: new Abstract: Calibration evaluates whether a model confidence aligns with its empirical accuracy. Existing studies often compare the calibrati...
Bridging Scientific Heritage: An Arabic--Russian Parallel Corpus and LLM Benchmark for Sustainable Knowledge Transfer
arXiv:2606.30943v1 Announce Type: new Abstract: Russian and Arabic are among the major languages of scientific communication. Language barriers impede the exchange of research r...
Truth or Sophistry? LoFa: A Benchmark for LLM Robustness Against Logical Fallacies
arXiv:2606.31039v1 Announce Type: new Abstract: Large Language Models (LLMs) exhibit strong semantic capabilities, yet their resilience to manipulative linguistic patterns such...
ScarfBench: Benchmarking AI Agents for Enterprise Java Framework Migration
ScarfBench: Benchmarking AI Agents for Enterprise Java Framework Migration
OpenAI frontier models and Codex are now available on AWS
OpenAI frontier models and Codex are now generally available on AWS, giving enterprises a new path to build with OpenAI through the AWS environments, controls, and procurement w...
Databricks brings GPT-5.5 to enterprise agent workflows
Databricks uses GPT-5.5 for enterprise agent workflows after the model set a new state of the art on the OfficeQA Pro benchmark.
Introducing GPT-5.4 mini and nano
GPT-5.4 mini and nano are smaller, faster versions of GPT-5.4 optimized for coding, tool use, multimodal reasoning, and high-volume API and sub-agent workloads.
CyberSecEval 2 - A Comprehensive Evaluation Framework for Cybersecurity Risks and Capabilities of Large Language Models
CyberSecEval 2 - A Comprehensive Evaluation Framework for Cybersecurity Risks and Capabilities of Large Language Models
Introducing the Gemini 2.5 Computer Use model
Available in preview via the API, our Computer Use model is a specialized model built on Gemini 2.5 Pro’s capabilities to power agents that can interact with user interfaces.
Is it agentic enough? Benchmarking open models on your own tooling
Is it agentic enough? Benchmarking open models on your own tooling
Gemma Scope 2: helping the AI safety community deepen understanding of complex language model behavior
Open interpretability tools for language models are now available across the entire Gemma 3 family with the release of Gemma Scope 2.
Strengthening our Frontier Safety Framework
We’re strengthening the Frontier Safety Framework (FSF) to help identify and mitigate severe risks from advanced AI models.
Show HN: Simulate what AI agents do to an engineering org (no signup)
Article URL: https://www.orgonaut.co/tools/agentic-reorg-simulator/ Comments URL: https://news.ycombinator.com/item?id=48753823 Points: 2 # Comments: 0
Investigating Multi-Agent Deliberation in Law
arXiv:2606.30906v1 Announce Type: new Abstract: Artificial Intelligence is increasingly applied to the field of law, and has the potential to increase access to justice. One par...
Why Solve It Twice? Hierarchical Accumulation of Skills for Transfer-Efficient ML Engineering
arXiv:2606.30911v1 Announce Type: new Abstract: ML engineering agents waste compute rediscovering known techniques because every competition is a cold start. We present HASTE, a...