Change Request drives maintenance mode

The Case:

Sunday night, 1:30 AM. the NOC team is ringing your phone, desperately trying wake you up. one of the critical business apps is down, and they have been trying to reach the team who owns it for close to an hour, but every call goes to voice mail.

you fall out of bed, rush through a shower, speed a little bit on the freeway towards the data center, and check in with security. It’s Sunday, but based on what the NOC sees from Ops Manager, everything is down, and you don’t know if you can get it back online, unsupported, before business starts in 5 hours or so.

you turn the corner towards your company’s rack, only to see the entire critical app team sitting around a card table; playing poker, eating pizza, and watching the software maintenance package on the critical server run though it’s offline upgrade. Their mobile phones are all stacked in the center of the table, because there is no signal in the data center, and they’re halfway through “CR99281: Routine maintenance update to critical business app”, scheduled for Sunday, 12:30 AM to 4:30 AM.

It’s a common problem. fortunately, if your company uses ops manager for health and alerting, and service manager for change tracking there are a lot of methods to make them interact. change requests driving maintenance mode isn’t available out of the box, but we can accomplish that using some PowerShell workflows.

The solution:

In short, we’ll be creating a scheduled workflow in service manager that examines change requests 5 minutes ahead of time, and sets affected CIs into maintenance mode, silencing alerts for the duration of the change.  This eliminates both the need to manually schedule maintenance (like the example above) and also eliminates the possibility that the service won’t come back to alerting after the change.

Prerequisites:

You’ll need the open source SMLets. This PowerShell module implements a lot more functionality then the out-of-the-box SM Snap-ins, and I find them a lot easier to use. This code was written against Beta 4 for System Center 2012 with SP1, however, it should be compatible with later editions.

You’ll also need to install the Operations Manager console on your Service Manager workflow server. I recommend installing the SP1 version of the console, even if you’re not on Ops Mgr SP1, since the RTM version has some conflicts with Service Manager and the console is backwards compatible.

Implementation:

Create a new MP in the authoring tool, and Extend the change request class with at least one new bool property “MaintenanceModeCompleted”. This is used by the workflow to ignore change requests it has already examined.

Next, you’ll need to extend the change request form to expose the ScheduledDowntime date properties and the HasDowntime flag. your team will use these properties to tell the workflow whither and when to schedule maintenance in operations manager. I usually take this opportunity to expose all of the scheduled and actual date fields while I’m already in there customizing stuff.

Finally, you’ll need to create a new workflow, configure it to run every 5 minutes, add a PowerShell activity, and import the PowerShell script below.

# Operations Manager Maintenance mode from change request
# Service Manager scheduled periodic workflow
# Version History:
# v1.0 - Oct 17, 2013: Clean room rewrite by Thomas.Bianco@sparkhound.com

powershell
{
# load order of OM & SM commandlets are important.
# SMLets correctly extends Ops Manager's commandlets
# however, Ops Manager will complain and fail if it seems SMLets already loaded
# using an isolated powershell session to ensure this
import-module operationsmanager
import-module smlets

#Work Item class definitions
$CRClass = Get-scsmclass System.WorkItem.ChangeRequest$

#Relationship class definitionss
$WIAboutCIRelClass = Get-SCSMRelationshipClass System.WorkItemAboutConfigItem$

#build a list of all change requests that are going to be in a downtime window 5 minutes from now
$Now = (get-date).ToUniversalTime()
$SearchTime = $now.addminutes(5)
$CriteriaString = "ScheduledDowntimeStartDate <= '$SearchTime' and ScheduledDowntimeEndDate >= '$SearchTime' and HasDowntime == '$true'"

# for large environments, a useful optimization is to extend the CR class
# with a flag for if the change is approved
# use criteria similar to " and IsApproved == '$false'"
$Criteria = new-object "Microsoft.EnterpriseManagement.Common.EnterpriseManagementObjectCriteria" $CriteriaString,$CRClass

# get and filter the list for changes that are not already in maintenance mode
$CRList = get-scsmobject -Criteria $Criteria | ? {$_.MaintenanceModeCompleted -ne $true}

#this should be null if no CRs found.
if ($CRList)
{
  # Build Explicit OM Connections
  # based on the connectors from the service manager linking framework MP
  # we have to read the raw XML and get server names from the write action rules
  $MPXML = [XML] (Get-SCSMManagementPack linkingframework.config).GetXML()
  $RuleArray = $MPXML.ManagementPack.monitoring.Rules.rule | ? {$_.WriteActions.WriteAction.Typeid -match "OpsMgrConnector.Sync"}
  $OMServerNames = ($RuleArray | ForEach {$_.writeactions.WriteAction.ConfigData.RemoteEndpoint }) | select -unique
  $OMServerNames | Foreach { New-SCOMManagementGroupConnection -computername $_ }
  $ScomConnections = Get-SCOMManagementGroupConnection

  ForEach ($CR in $CRList)
  {
    #take notes on errors for later
    $MaintenanceModeErrors = [string]""

    #get relationship objects for config items
    $CIRelArray = (Get-SCSMRelationshipObject -BySource $cr | ? {$_.Relationshipid -eq $WIAboutCIRelClass.id -and $_.IsDeleted -eq $False})

    Foreach ($CIRel in $CIRelArray)
    {

      #walk the Affects CI relations and get the full object
      $CI = (get-scsmobject -id $CIRel.TargetObject.id)

      #then check if the full object has a matching ops mgr class
      if (get-scomclass -name $CI.ClassName -scsession ($ScomConnections))
      {
        # if the Class exists in one of the connected ops manager systems
        # try to process it into maintenance mode.
        $CIfound = $False

        #Walk each connection and find the CI
        Foreach ($ScomConnection in $ScomConnections)
        {

          $OMCI = Get-SCOMClassInstance -id $CI.id -SCSession $ScomConnection
          if ($OMCI)
          {
            #found it in this OM Group. set maintenance mode
            $CIfound = $True
            if (-not $OMCI.InMaintenanceMode)
            {
              try
              {
                # [PartialMonitoringObject].ScheduleMaintenanceMode (start, end, Reason, Comments, TransversalDepth)
                $omci.ScheduleMaintenanceMode( (get-date).touniversaltime(),
                                               [System.Datetime]::specifyKind($cr.ScheduledDowntimeEndDate,[System.DatetimeKind]::UTC),
                                               "PlannedOther",
                                               ("Scheduled maintenance regarding "+$CR.Name),
                                               "Recursive" )
              }
              Catch
              {
                $MaintenanceModeErrors = $MaintenanceModeErrors + "`nFailed to set Maintenance mode for "+$OMCI.Name+".`nThe error was "+$_.ToString()+"."
              }
            } # if (-not $OMCI.InMaintenanceMode)
          } # if ($OMCI)
        } # Foreach ($ScomConnection in $ScomConnections)

        If (-not $CIfound)
        {
          #no OMCI object
          $MaintenanceModeErrors = $MaintenanceModeErrors + "`nCould not find operations manager class instance for "+$CI.Name+" of type "+$CI.ClassName+" in any connected Operations Manager group. Maintenance mode can't be set."
        }

      } #if (get-scomclass -name $CI.ClassName -scsession ($ScomConnections))
      Else
      {
        #Class does not exist in any connected OM group
        $MaintenanceModeErrors = $MaintenanceModeErrors +"`n"+ $Ci.Name +" is of class "+$CI.ClassName+" which does not have a matching class in any connected Operations Manager group. Maintenance mode can't be set."
      } #Else (get-scomclass -name $CI.ClassName -scsession ($ScomConnections))
    }

    #update the change request accordingly
    $CRHash = @{
      MaintenanceModeCompleted = $true
      Notes = $cr.Notes + $MaintenanceModeErrors
    }
    Set-scsmobject $CR -propertyhash $CRHash
  } #ForEach ($CR in $CRList)
} #IF {$CRList)
} #Powershell

you’ll need to deploy the MP and the Workflow .DLL to your management group server.

In Other News:

In large environments, it may be beneficial to only examine change requests that have been approved and are ready to execute. you might want to extend the Change Request class with a new Bool Property representing if the change is approved, and then add “and IsApproved == ‘$false'” to the Criteria String. The implementation of a workflow to maintain that flag is left as an exercise to the reader.

You might think that 5 minutes is too frequent. Unfortunately, the Operations Manager objects do not have an easy way to schedule maintenance in advance, so the trade-off for running this workflow less frequently, say every hour, would be that the workflow would need to look an hour ahead. This would mean that the Servers might go into ops manager maintenance mode as early as an hour before the maintenance window on the change request (11:30 PM, in our example), rather than only 5 minutes ahead. if you are ok with this, go ahead and adjust the schedule and the look-ahead time in the workflow.

One trick that I did is finding ops manager servers by reading the raw MP XML for all of the connectors. this is kind of an ugly hack, but it seems to work well. there are better ways to read data from connectors, but they mostly require assembly reflection and knowledge of the SDK, so I made a judgement call.

You might be asking “Why not Orchestrator?” it seems like a perfect use case. you’re probably right, more on that elsewhere.

Advertisements