Longer than usual processing times

Incident Report for AssemblyAI

Postmortem

We wanted to reach back out to share more detailed information on the incidents that occurred on 8/7 and 8/8. These incidents were caused by separate issues See the information below for a description of each issue and the steps taken to remedy them.

8/7
Incident Cause
An inefficient database usage pattern change was submitted and deployed on 8/2. Although inefficient, due to the standard load prior to deployment, no regression was detected. We encountered a new peak load on 8/7 which along with this inefficiency led to a large increase in latency (turnaround times) which was the incident faced this day.

Resolution
We identified and reverted the database usage change committed on 8/2 that led to this slowdown.
We upgraded our database instance size.

8/8
Incident Cause
A full table query was run against our write replica database as a team worked to transfer data to BigQuery for business intelligence tooling. This led to database contention and slowed down our production service.

Resolution
We implemented more fine-grained controls and roles for database access along with an approval process to verify production database queries are run against the correct replica and will not impact customers.

If you have any questions about this information feel free to reach out to support@assemblyai.com.

Posted Aug 20, 2023 - 23:32 UTC

Resolved

Processing times have now returned to the normal range. We will continue to monitor traffic to ensure good performance going forward but at this point, this degradation of service is being marked as resolved.

Posted Aug 07, 2023 - 23:00 UTC

Monitoring

We are continuing to monitor the situation but we are seeing processing times go down as a result of the latest changes we deployed. While we are not quite back to the normal processing range yet, we are moving in that direction. We will continue to monitor processing times and update you again once we have fully returned to the normal range.

Posted Aug 07, 2023 - 22:36 UTC

Identified

The earlier changes we implemented did not improve processing times in the way that we hoped. We have another potential fix for this issue that we are working to release into Production. We will share another update here once that fix is live and we have had time to monitor its impact.

Posted Aug 07, 2023 - 22:03 UTC

Monitoring

We have made some changes to address the slower-than-normal processing times we have been seeing and are currently monitoring the results of those changes to measure improvement.

Posted Aug 07, 2023 - 21:17 UTC

Update

We are still working to identify the root cause of the slowdown but it looks to be related to database load, which is causing degraded performance and slower-than-expected processing times. Jobs are still completing but with longer than normal turnaround times.

Posted Aug 07, 2023 - 20:41 UTC

Update

We have seen some improvement in processing times but we are still operating a higher than usual times. We are continuing to identify the root cause of the issue and will provide further updates as we learn more.

Posted Aug 07, 2023 - 20:22 UTC

Investigating

We are currently seeing slower-than-normal turnaround times for our Async API. Our Engineering team is actively investigating the issue and we will share additional information here as it becomes available.

Posted Aug 07, 2023 - 19:58 UTC

This incident affected: APIs (Asynchronous API).