Resend – Incident yarn for February twenty first, 2024

Resend – Incident yarn for February twenty first, 2024

Summary (TL;DR)

On February twenty first, 2024, Resend skilled an outage that affected all customers because of a database migration that went obnoxious. This averted customers from using the API (along with sending emails) and gaining access to the dashboard from 05:01 to 17:05 UTC (about 12 hours).

The database migration by likelihood deleted recordsdata from production servers. We straight away started the restoration task from a backup, which done 6 hours later. Unfortunately, as soon because it done, we learned that it didn’t revive the records, so we had to birth the restoration task a second time with a various backup.

All the blueprint through this time, no API requests had been being licensed and no recordsdata being stored. For recordsdata created sooner than the migration, there used to be 5 minute of recordsdata loss from when the migration started and the database went offline from 04:50:00 to 04:56:27 UTC. We are at level to working on re-populating the records from this 5-minute window.

We sincerely categorical regret for the influence and pains brought about by this outage. We pickle immense significance on reliability, but this week, we fell attempting our dedication to you all. It is apparent that we gain a prolonged blueprint to head in changing into an industry-main infrastructure provider, but in learning from this incident, we are able to give a boost to our operations and tooling to relief a ways flung from outages like this in due route, whatever the rationale.

Timeline

All times are in Coordinated Universal Time (UTC)

February twenty first, 2024

  • 04:56: Database migration started
  • 04:57: Seen tables being dropped from the production database
  • 05:01: Began restoring the database from a backup
  • 05:02: Posted on dwelling page, updating every 30-60 minutes unless resolution
  • 11:02: First restoration task done
  • 11:03: Realized the first backup failed and began to examine
  • 11:33: Chanced on that the backup failed because of a obnoxious option of the backup timestamp
  • 11:48: Increased compute to bustle up the restoration task – up to this point database memory from 128GB to 256GB and CPU from 32-core ARM to 64-core ARM
  • 12:05: Began restoring the database from an older backup
  • 17:01: 2nd restoration task done
  • 17:02: API started receiving requests
  • 17:05: Dashboard used to be accessible all every other time, and incident used to be resolved

What took pickle

While constructing a feature, we performed a database migration explain within the community, but it incorrectly pointed to the production ambiance as an different, which dropped all tables in production.

The first are attempting to revive the database took 6 hours but failed because of a obnoxious option of the backup timestamp. The second are attempting to revive took an additional 5 hours and succeeded, bringing all recordsdata attend moreover a 5-minute window of recordsdata loss.

Impact

All customers had been unable to send emails, exercise the API, or entry the Resend dashboard for 12 hours from 05:01 to 17:05 UTC.

For recordsdata created sooner than the migration, there used to be 5 minutes of recordsdata loss from when the migration started and the database went offline from 04:50:00 to 04:56:27 UTC.

Next steps and enhancements

  • Re-populate recordsdata from the 5-minute window of recordsdata loss.

  • No accessible individual position have to silent gain write privileges on the production database.

  • Pork up native constructing to lower risks connected to database migrations.

  • Make redundancy to retain sending feature even at some stage in a database outage.

  • Amplify cadence for catastrophe recovery assessments.

  • Enforce incident banner on Resend dashboard to uncover customers snappily.

To our clients, we are deeply sorry that this incident occurred and that it averted you from turning in your mission-critical emails. We know that actions discuss louder than phrases, so we are able to proceed to be taught, develop, and supplies a boost to, starting by imposing the enhancements listed above.

Read More

Leave a Reply

Your email address will not be published. Required fields are marked *