Tuesday, December 31, 2024

Azure AppGW on-demand probe success and unhealthy backend

Lately while working on one of the AppGW related issue observed this odd behavior where the status of on-demand health-probe was successful with http response 200 while the actual backend health status for the same backend pool was unhealthy with the following generic error. 

Here we further checked the respective service status and port on the backend server, NSG and routing and found everything correct. 

Later while working with the Azure support the engineer informed us about a known issue** where if the backend service is only supporting TLSv3 then we might run into this issue of on-demand prob status is successful while due to the current tls support limitations the actual backend health is unhealthy. This is due to the fact that currently while connecting to the backend AppGW only supports TLS 1.0, 1.1 & 1.2 and here the backend server only supporting the tls_1.3 causing tls handshake failure resulting the probe failure and unhealthy backend. Once the application team changed the tls version support from TLSv1.3 =>TLSv1.2 the issue got resolved.

Here the frustrating part was the generic backend server reachability error giving no indication that it could be related to the unsupported TLS version and the unavailability of the health probe logs. 

Now when we know this bug and TLS limitation so if run into such issue then as part of the troubleshooting should test the TLS support for the backend service. It can be verified using the good old "openssl" or "curl" command line tools.

Assuming that your internal URL has the required dns mapping in place then,

#openssl s_client -connect <your-internel-domain.com>:443 -tlsv1_2

or

#curl -v -l https://<your-internel-domain.com> --tlsv1.2 --tls-max 1.2

In case if you don't have the required internal dns configuration for your site then either create a local host file entry or alternatively add the --resolve switch in the curl. 

#curl -l -v https://<your-internel-domain.com> --tls-max 1.2 --resolve <your-internel-domain.com>:443:<backend server IP>

**Azure internal product team is already aware of this bug and is actively working on it. However, at the moment unfortunately didn't share any specific ETA.

Reference: AppGW ssl related limitationsTroubleshoot backend health issues in Application Gateway

Hope this will help...thanks 😊

Sunday, June 30, 2024

Azure backup | Staging storage account not visible during restore

Recently, I got this query from one of the team members working on a VM restore task using Azure Backup encountered an issue. He was unable to find the available storage accounts for staging location selection despite there being a storage account in the subscription.

I had come across this issue earlier and was aware about the cause however couldn't recall at that moment so spent some time looking into the documentation and when found thought of making a note as who knows about the next time😉or let's hope Microsoft will remove this limitation😊


You can see the same behavior in the following related screenshots,

As you can see, I have three storage accounts in my test subscription, where one is in another region, and out of the other two one is having Standard_ZRS SKU.


During restore, as you can see the storage account having ZRS sku is not available for selection.



BTW if you click on the info icon in front of the "Staging Location" option, it will show you the related hint as follows😄

Reference:  Azure Backup Restore documentation.

That's all for today, thanks...

Sunday, June 23, 2024

Azure AKS cluster upgrade error "... vmss has reached its limit of 10 models ..."

In this short post, we will discuss a terminal provisioning state failed error encountered during the AKS cluster upgrade. The control plane was successfully upgraded, but the AKS node pool upgrade subsequently failed with the following error.


Here, as we can see from the error, it pertains to the vm scale set model, so lets first discuss it. Essentially, the VMSS model represents the desired state of the vmss as a whole and implies a property of the scale set which affects VMs, for example the VM size, the OS version or an extension.

For AKS node pools the orchestration mode is set to Uniform (designed to be a collection of similarly configured virtual machines) and the VMSS will have the default model upgrade policy, set to manual, where VMs can have different models but that is restricted to 10 models overall (all the VMs in the VMSS can have 10 unique configurations).

So, you might see this error if you are having AKS node pool with 10+ nodes and they are not having the latest model. You can check this by going to the respective node pool VM Scale Set =>Instances, check the Latest Model column and it would show you Yes/No.

To proceed from here, select the required instances showing the "Latest Model as NO" and click on "Upgrade" from the available highlighted options on top.

As this might reboot the instance and impact the availability so make sure that you are doing it caution.

Once you have upgraded the instance to the latest model then re-initiate the AKS upgrade again and this time you should not see this error.

If you're interested in learning more about the scale set upgrade policies then can check the related documentation available here, 

I hope you find this information useful. Thank you!

Saturday, June 22, 2024

Azure AppGW http error 403 and outdated browser

This post is about encountering an AppGW HTTP error 403 while attempting to access a site published through Azure Application Gateway v2 with WAF enabled,

Upon reviewing the AppGW logs (category== ApplicationGatewayFirewallLogs), noticed the error was specific to a single client IP and in error msg related detail there was mention of browser cookies. looking at this recommended the user to try accessing the site using a different browser (previously Google Chrome) and the site was accessible.

Following that, we updated Google Chrome to the latest version and rechecked and this time, we encountered no errors, and the site was accessible without any issues.

The main takeaway here is that outdated browsers can sometimes trigger the web application firewall to block incoming requests. Therefore, if you come across the related indication in the AppGW logs then check the url using a different browser also the version of the browser in question and update it to the latest.

Related sample kql query to use for AppGW logs,

AzureDiagnostics

| where TimeGenarated > ago(1h)

| where Category== "ApplicationGatewayFirewallLog"

| where clientIp_s== "<required sourceIP" and requestUri_s contails "/path in your case"


In my case, extract from query output,

Message: Detects MySQL comment-/space-obfuscated injections and backtick termination

OWASP CRS ruleSetVersion_s: 3.2

ruleGroup_s: REQUEST-942-APPLICATION-ATTACK-SQLI

details_message_s: Pattern match (?i:(?:(?:(?:(?:trunc|cre|upd)at|renam)e|(?:inser|selec)t|de(?:lete|sc)|alter|load)\s*?\(\s*?space\s*?\(|,.*?[)\da-f"'`]["'`](?:["'`].*?["'`]|(?:\r?\n)?\z|[^"'`]+)|\Wselect.+\W*?from)) at REQUEST_COOKIES.

details_data_s: {,"campaigns":{"34645675687werwe4567rit6":{ found within [REQUEST_COOKIES:ORA_PERS:{"ids":["-23434645757657"],"campaigns":{"":{"activeBlocks":["c1","C2","C3","C4"],"pointer":"E1","event":"-687897890978860392"}}}]}


If you're interested in learning more about HTTP error codes, you can explore the following links:

HTTP response status codes

HTTP response codes in Application Gateway


I hope you find this information useful. Thank you!