Intermittent: SP-initiated SAML response was received unexpectedly

Hi,

We are experiencing this error on our production server intermittently. Our SAML SSO has worked fine for years but recently we have had 3 incidents where SSO is failing with this error for a few hours but then it just starts working again. For two of these occurrences it happened and recovered during the night before we were able to start troubleshooting. The other time it occurred, restarting the server appeared to resolve the issue. We have changed VMs between incidents so we know it is not specific to the hardware.

I have read through the troubleshooting post on this forum regarding this error but due to the intermittent nature of our case, I am yet to understand how this error is occurring and most of the causes in the post would be permanent problems rather than an intermittent one.

Do you have any suggestions on how this error would occur intermittently?

Thanks

Dan

Hi Dan,

For SP-initiated SSO, we maintain SAML session state and check this state when a SAML response is received.

By default, the SAML session state is maintained in memory and is indexed by a saml-session cookie.

If the cookie is missing or the session state it indexes is missing, we throw the exception you’re seeing.

It’s hard to know the specific cause without more information.

If your application is deployed to multiple web servers, either configure sticky session as the load balancer or store the SAML session state in a central repository such as a database.

Is there any pattern you can identify? For example, specific users or browsers?

Has anything changed?

If you can reproduce the issue, I suggest:

  • using the browser developer tools to capture the network traffic to see whether the HTTP Post of the SAML response includes the saml-session cookie
  • enabling SAML trace and sending the log file as an email attachment to support@componentspace.com mentioning your forum post.

Thanks for the response.

We are using a single web server, so there should be no session state issues. There is no specific pattern we have identified and it affects all SSO users no matter which IDP. The problem seems to fix itself after a few hours or by restarting the server. Nothing has changed in our SSO in years and it has worked flawlessly until now.

For now we have added an alert on the specific error, so that we get woken up if it occurs overnight. That way we can try and gather more information if it happens again.

The fact that it resolves itself is what is perplexing!

Dan

Once the problem occurs are all SSO attempts for all users failing?

I agree that it’s very odd that the problem fixes itself after a few hours.

Let us know what you find.

Thanks.

Yes, all SSO attempts fail when the problem occurs. It could be a week or few before it occurs again but I will certainly let you know when I have more details!

Thanks.

So the issue has occurred again. I was able to capture the network requests and have a har file but was unable to switch debug logging on for the Component namespace as it seems that recycling the application pool fixes the problem. I need to do that to reload the settings of the application.

When I compare the network requests from a successful SSO sign in to those of the one that failed, there appears to be no difference in presence or settings of saml-session cookies.

Sill looking for ideas on what the problem could be. Would sending you the har file help?

Yes, please email the HAR file.

Sent!

Thanks for the HAR.

It includes an HTTP Post of a SAML response to your assertion consumer service endpoint. The SAML response includes an InResponseTo field so this is SP-initiated SSO.

The HAR doesn’t capture the earlier flow of the SAML authn request being sent to the IdP but I assume this occurred.

The HTTP Post includes a saml-session cookie so a missing cookie isn’t the issue.

The issue must have something to do with the session store. By default, session state is stored in the IDistributedCache. The default implementation uses a memory cache.

Does your application specify a different IDistributedCache implementation?

Could there be something within the application that’s clearing this cache?

It’s a very odd issue. Just to summarize to ensure I understand correctly:

  • everything had been working without issues for a number of years
  • recently this issue has started but’s it’s intermittent
  • once the issue occurs all SSO attempts fail for all users to all identity providers
  • the issue can resolve itself somehow or is resolved after a server restart

Is that correct?

I think the next step is to enable SAML trace. I recommend using something like Serilog with a log file that rolls over daily. Please email us the log file capturing the issue. It would be good if it captured several failed SSO attempts.