We are experiencing this error on our production server intermittently. Our SAML SSO has worked fine for years but recently we have had 3 incidents where SSO is failing with this error for a few hours but then it just starts working again. For two of these occurrences it happened and recovered during the night before we were able to start troubleshooting. The other time it occurred, restarting the server appeared to resolve the issue. We have changed VMs between incidents so we know it is not specific to the hardware.
I have read through the troubleshooting post on this forum regarding this error but due to the intermittent nature of our case, I am yet to understand how this error is occurring and most of the causes in the post would be permanent problems rather than an intermittent one.
Do you have any suggestions on how this error would occur intermittently?
For SP-initiated SSO, we maintain SAML session state and check this state when a SAML response is received.
By default, the SAML session state is maintained in memory and is indexed by a saml-session cookie.
If the cookie is missing or the session state it indexes is missing, we throw the exception you’re seeing.
It’s hard to know the specific cause without more information.
If your application is deployed to multiple web servers, either configure sticky session as the load balancer or store the SAML session state in a central repository such as a database.
Is there any pattern you can identify? For example, specific users or browsers?
Has anything changed?
If you can reproduce the issue, I suggest:
using the browser developer tools to capture the network traffic to see whether the HTTP Post of the SAML response includes the saml-session cookie
We are using a single web server, so there should be no session state issues. There is no specific pattern we have identified and it affects all SSO users no matter which IDP. The problem seems to fix itself after a few hours or by restarting the server. Nothing has changed in our SSO in years and it has worked flawlessly until now.
For now we have added an alert on the specific error, so that we get woken up if it occurs overnight. That way we can try and gather more information if it happens again.
The fact that it resolves itself is what is perplexing!
Yes, all SSO attempts fail when the problem occurs. It could be a week or few before it occurs again but I will certainly let you know when I have more details!