It only took 18 characters to bring Salesforce down
All it takes is 18 characters to bring down your Salesforce org.
How the problems started
First, there was a Slack ping from my manager.
"It's not a Salesforce outage. But, I need your help on an incident. I think something is wrong with one of our Salesforce features. Our integration team can't create Cases and it's a major outage for them." Then, he invited me to a Slack channel.
We had been dealing with some performance and maintenance issues in Salesforce earlier in the week, so I thought that may have been the issue. But, I was wrong.
How many Integration engineers does it take to triage Salesforce outages?
The details: ten engineers from another team were trying to figure out why their integration service wasn't creating Cases in Salesforce.
We asked them if it was possible to send us the fields they were setting when they tried to create a Case.
It was important to ask this so we could try to reproduce the issue.
"What's the status? Why is Salesforce down?"
I login to Salesforce and navigated to the "Developer Console."
Then, I pasted the Apex equivalent of what the Integration User was using to create a Case into the "Execute Anonymous" window.
It was a "quick-and-dirty" approach. But, it was the fastest way I could think of to reproduce the issue. Then, I see the error: "Invalid cross reference id: []"
Great, I can reproduce the issue. I no longer needed the Integration team's service user to create Cases for me.
Then, I pulled the Apex logs. After some browsing, I didn't see any valuable information in Salesforce that stood out to me.
My manager, who was on a Zoom call with me while I was doing this, yelled "Debugging in Apex sucks!" I think I'd agree with him here.
I can reproduce the issue in Salesforce, but I don't see anything valuable in the logs. I'm confused.
The Integration service User creating the Cases has access to all records I am setting in my request.
So I try to create the same Case as my User. To my surprise, I could create the Case! I didn't get any errors.
As I realized this, other engineers asked if I could get on a Slack huddle. I joined and learned that they only get the error if they set the Case's "Category" field to specific values.
I learn that if I'm logged in as the Integration service User and I set the Category field as "Outreach" or "Follow-up," I'll get the error. Otherwise, the Case creates.
What could be causing this issue?!?
Getting to the root cause the Salesforce outages
We spent time trying to identify the root cause.
Checking our previous deployments in Flosum
Alright - "what did we deploy yesterday?" So I go and check the history of our Salesforce deployments.
I open up our Flosum instance to audit everything we've deployed over the last 24 hours.
Hmm - all we deployed were some custom Label updates. How could that contribute to bringing our Salesforce instance down?
It doesn't seem like that caused the problem. How could labels be the culprit here?
So, I move to my next line of thinking of what could be causing these issues.
Note: I'm not sponsored by Flosum; I just think their DevOps product helps Salesforce administrators deploy things way faster than change sets. Here's a link to visit their website: https://flosum.com.
Let's take a look at our codebase
We fired up Visual Studio Code and searched where the Case's Category field showed up in our Salesforce codebase.
We got hits on the following places:
- Apex classes
- Apex triggers
- Flows
We found a suspicious Flow in Salesforce
Ultimately, we found the exact Flow that runs when we create the Case.
I searched the Flow metadata. Then, I discovered the Flow changes the Case Owner when the Category's field is "Outreach."
The Flow sets the Case Owner value to a value in a Label named "System.Label.Outreach_Queue."
We're almost there
Then, we looked at the value in the Outreach_Queue. It's a queue Id.
So, we use SOQL to query the Queue Id.
No results! We're onto something here. We have an idea of what is bringing Salesforce down.
What if the Queue Id is from a sandbox, not production?
Remember when I disqualified the Labels deployment as a potential root cause? Because there's no way Labels brought down our Salesforce support functionality - right?
Sure enough, we discovered that the Id in the Label was a value from a sandbox!
The Flow was setting the Case Owner to an Id that did not exist in production.
We updated the Label value to reflect the production Queue Id. Sure enough, everything worked!
Right as our Integration team was about to ping us about the Salesforce status, we notified them. We gave them the details.t
The Salesforce team identified the root cause and closed the page on this incident. Their services were back up and running.
But how could we have prevented this?
Run Flosum's "Overwrite Protection" feature
We currently use Flosum for our DevOps solution. There's one feature that I like about the product.
Flosum has a feature where you can do a real-time check between your feature branch and a target Salesforce instance.
It's called Overwrite Protection.
Our feature branch had all the labels. Our target Salesforce instance was our production instance.
How we could prevent it: Our release manager didn't run Flosum's Overwrite protection feature. So, they overwrote the label Id that was in production. If we ran the Overwrite Protection feature, we would have caught this inadvertent change.
Test your Salesforce Flows
There weren't any unit tests that covered a Case going through this process. If we unit tested for this scenario, we could have caught the error before our changes made it to production.
Debugging in the Cloud is fun, isn't it?