OCI FortiGate HA Cluster – Reference Architecture: Code Review and Fixes – Eclipsys

Written by Kosseila Hd | Jan 18, 2024 12:26:00 AM

Introduction

OCI Quick Start repositories on GitHub are collections of Terraform scripts and configurations provided by Oracle. These repositories are designed to help organizations quickly deploy common infrastructure setups on the OCI Platform. Each Quick Start focuses on a specific use case or workload, which simplifies the process of provisioning on OCI using Terraform. A sort of IaC-based reference architecture.

Today, we will code review one of those reference architecture which is a Fortinet Firewall Solution deployed in OCI.

Note: This article won’t discuss the architecture, but will rather address its terraform code flaws and fixes.

Why do some errors never get to your OCI Resource Manager Stack?

Certain Terraform errors may not reach your RM stack due to its design. For instance, RM allows the hardcoding of specific variables, like availability domains, directly in its interface. This sidesteps the need for these variables to be checked by native conditions in the TF code.
Moreover, RM reads these variables from the schema.yaml file, altering the behavior compared to local Terraform CLI execution. This approach can result in certain errors being handled or bypassed within the RM environment, creating a distinction from standard Terraform workflows.

The Stack: FortiGate HA Cluster using DRG – Reference Architecture

The stack is a result of the collaboration of both Oracle and Fortinet. This architecture is based on a Hub and Spoke topology, using a FortiGate firewall from OCI Marketplace. I actually deployed it while working on one of my projects.

For details of the architecture, see Set up a hub-and-spoke network topology.

The Repository

You will find this terraform config under the main oci-fortinet GitHub repository. But not in the root directory.

The folder in question is drg-ha-use-case under oracle-quickstart/oci-Fortinet/use-cases/drg-ha-use-case

The Errors

At the time of writing this, the errors were still not fixed despite opening issues and sharing the fix. You can see that the last commit goes back to 2 years. You will need to clone the repo and cd to the drg-ha-use-case subdirectory

$ git clone https://github.com/oracle-quickstart/oci-fortinet.git 
$ cd use-cae/drg—ha-use-case
$ terraform init

1. Data source error in Regions with unique AD

You will face this issue on a region with only one availability domain (i.e. ca-toronto-1) as the data source of the availability domain will fail the terraform execution plan.

CAUSE: See issue #8

In the above error terraform complains about the availability of data sources having only one element
This impacts 2 of the “oci_core_instance resource” blocks (2 web-vms, 2 db-vms).
- compute.tf => line 235 & line 276

Problem?
- count.index for the data source block will always be equal to 0 on single AD regions (1 element).
  See data_source.tf line 8-10. This configuration hasn’t been tested in single AD regions.

 $ vi data_source.tf 
   # ------ Get list of availability domains
   8 data "oci_identity_availability_domains" "ADs" {
   9  compartment_id = var.tenancy_ocid
   10 }
  …

Reason:
- In terraform the count.index always starts at 0, if you have a resource with a count of 4, the count.index object will be 0, 1, 2, and 3.
- Let’s take for example the “web-vms” oci_core_instance block in compute.tf > line 235

- If we run the condition:
  – The variable availability_domaine_name is empty
  – The ads data source length = 1 element. That means that the AD name will be equal to
  ads data_source collection with an index value of [0+1] = 1
- data…ads.availability_domains[1] doesn’t exist as it only contains 1 element

The Solution

Complete the full availability domain conditional expression on line 235 and line 276 (web-vms/db-vms)

Add the case where data source ads.availability_domains has 1 element (the region has one AD only)

Bad Logic

Seeking the name of the count.index+1 availability domain is still wrong when the region has more than 1 AD

Example: say you want to create 3 VMs and your region has 2 Availability domains >1.
- The first iteration [0] will set count.index+1 = 1 ( 2nd data source element = AD2)
- Then the second iteration sets a count.index+1 = 2 ( 3rd data source element=AD3)
- The 2nd and 3rd iterations will always fail because there are only 2 ADs (index list [0,1]).

2. Wrong compartment argument in the security list data sources

Another issue you will run into is a failure to deploy subnets due to data source collection being empty (no element).

CAUSE: See issue #9

In the above error terraform complains that {allow_all_security} data source is empty
- This impacts all FortiGate subnet blocks in the config as they all share the same security lists.
  - network.tf => line 240 & more

Reason:

In this configuration, there are 2 compartments, one for compute and another for network resources
If you take a look at “allow_all_security” block in datasource.tf > line 64-to-74
You’ll notice a wrong compartment ID in the security lists data source (compute instead of network)

Solution

This was a silly mistake, but took me a day to figure it out while delving through a pile of new Terraform files.

All you need to do is replace the compute compartment variable with var.network_compartment_ocid

 Edit network.tf line 64-74
# ------ Get the Allow All Security Lists for Subnets in Firewall VCN
data "oci_core_security_lists" "allow_all_security" {
  compartment_id = var.network_compartment_ocid    <--- // CORRECT Compartment
  vcn_id         = local.use_existing_network ? var.vcn_id: oci_core_vcn.hub.0.id
...

3. More Code Inconsistencies

I wasn’t done debugging as I found other misplaced compartment variables in some VNIC attachments data sources

See datasource.tf: Line 103-115 &118-130, you need to replace them by var.compute_compartment_ocid

Conclusion and Recommendations

This type of undetected code issue is why I never trusted the first deployment in Resource Manager.
In order to avoid problems in the future, especially if you decide to migrate out of RM at some point, I suggest the following workflow:

1. Run locally and validate any code bug
2. Run on Resource Manager
3. Store to git repo (blueprint with eventual versioning)

I hope this was helpful as the issues I opened are still unsolved for over a year in their GitHub Repo.

View full post