Cloudflare has confirmed a global service disruption affecting its R2 object storage platform and several dependent services. The outage lasted for 1 hour and 7 minutes, leading to complete write failures and partial read failures across the board.
Launched as a scalable, S3-compatible storage solution, Cloudflare R2 offers users features like free data retrieval, multi-region replication, and seamless Cloudflare integration. However, during the incident, the platform recorded 100% write failure rates and 35% read failures globally, significantly impacting customers relying on R2 for their operations.
According to Cloudflare’s post-incident report, the outage occurred between 21:38 UTC and 22:45 UTC and was triggered by a critical error during a credential rotation process. In simple terms, the R2 Gateway — the API front-end handling requests — lost authentication access to its backend storage systems.
The root cause? A new set of credentials meant for production accidentally got deployed to a development environment. Meanwhile, the old production credentials were deleted. This left the production R2 Gateway running without any valid authentication, causing the widespread failure.
The error came down to a single missing command-line flag — --env production
. Without this parameter, the deployment system defaulted to the development environment, creating a major disconnect between the gateway and backend storage.
Cloudflare also shared that the misconfiguration wasn’t instantly noticeable. Their systems didn’t immediately detect the issue due to how credential deletions propagate within the infrastructure.
“The decline in availability metrics was gradual. A delay in propagating the old credential’s deletion slowed our discovery of the problem,” Cloudflare explained. “Instead of relying solely on availability metrics, we should have verified which token was active in the R2 Gateway service.”
Fortunately, the incident didn’t cause any customer data loss or corruption. However, it did trigger various levels of service degradation across Cloudflare’s product suite:
- R2 Storage: Full write failures (100%) and significant read failures (35%), though cached data remained accessible.
- Cache Reserve: Increased origin traffic due to the failed read attempts.
- Images and Stream Services: Uploads failed completely. Image delivery dropped to 25%, while Stream service performance fell to 94%.
- Additional Impacts: Services like Email Security, Vectorize, Log Delivery, Billing, and Key Transparency Auditor also faced disruptions.
In response to this event, Cloudflare is rolling out several preventive measures. The company has enhanced credential logging and validation and mandated the use of automated deployment tools to minimize human errors during critical operations.
Cloudflare is also updating its Standard Operating Procedures (SOPs) to require dual validation for sensitive tasks like credential rotations. They’re also improving health checks to detect such failures faster in the future.
Interestingly, this wasn’t the first time Cloudflare’s R2 suffered an outage caused by human error. A similar hour-long disruption in February stemmed from an operator mistake while handling a phishing URL report. Instead of blocking the malicious endpoint, the operator unintentionally shut down the entire R2 Gateway service.
That incident exposed gaps in Cloudflare’s safeguards, particularly around high-risk actions lacking proper validation checks. Since then, Cloudflare has committed to tightening account provisioning processes, enforcing stricter access controls, and requiring two-party approvals for critical operations.
With these new changes, Cloudflare aims to bolster the reliability of its R2 service and prevent future disruptions caused by human mistakes.