A security researcher built a deliberately vulnerable mobile app and ran nine LLMs against it to see which could exploit a common Firebase misconfiguration. GPT-5.5 solved it 70% of the time; most others scored zero.
June 4, 2026
Security is front and center this week. Anthropic published a detailed breakdown of how it limits agent blast radius across its products, with a striking finding: users approved 93% of permission prompts, making human oversight unreliable at scale. Separately, GPT-5.5 exploited a real Firebase misconfiguration 70% of the time in a controlled benchmark, while most other models scored zero. On the tooling side, AnythingLLM v1.13.0 ships automatic model routing between local and cloud models, a practical cost-cutting option for teams running mixed inference setups.
A security researcher built a deliberately vulnerable mobile app and ran nine LLMs against it to see which could exploit a common Firebase misconfiguration. GPT-5.5 solved it 70% of the time; most others scored zero.
Anthropic's engineering team details the containment strategies built for claude.ai, Claude Code, and Cowork, explaining why sandboxes and access boundaries beat human-in-the-loop supervision. One key finding: users approved 93% of permission prompts, making oversight unreliable over time.
GPT-Rosalind now brings enhanced biological reasoning, medicinal chemistry expertise, genomics analysis, and experimental workflow support to life sciences teams. Builders working in drug discovery or genomics research have new capabilities to integrate today.
OpenCompass 0.5.2 ships support for 14 new benchmarks spanning science, math, and instruction-following, plus new model and API integrations. Here is what product engineers need to know to update their evaluation pipelines.
AnythingLLM v1.13.0 ships a Model Router that automatically sends each message to the right model, mixing local and cloud AI in a single conversation. Builders can now cut API costs without giving up quality on complex tasks.