Alert: Logical Disk Fragmentation Level is High
Issue: I started receiving these after the functionality was added to the OpsMgr management pack. My first thought on this was cool! I can determine the level of fragmentation which exists on my various drives in the environment. The alerts first appeared as a group after a weekly 3:00 am check was done by the management pack. By default this monitor provides a warning alert if the volume is more than 10% fragmented which changes the health state from green to yellow. I understand the logic to why these are scheduled to occur in off-hours, but hopefully nobody’s going to wake up at 3:05 in the morning to defragment a drive. Digging through the monitor there is even a recovery which will allow for the automatic defragmentation of the drive (that’s pretty cool). Using this recovery we could create a group of servers which we would automatically perform the defragmentation recovery when the drive goes into a warning state. This would allow auto-defragmentation of the majority of the drives in an environment while allowing us to exclude other servers from auto-defragmentation.
Resolution: I have held off to this point in time to write this up because addressing this alert gets a little complicated and even when spending time thinking about how to address it there isn’t a simple process to follow for this alert. In one of our environments with 250 servers we had 450 warnings on disk fragmentation levels in the environment (which represented more than half of the total alerts in the environment). When 60% of the alerts come from the same monitor these draw attention pretty quickly. We have used the task available in the MP to defragment them (which works well even under local system) and then we manually reset the health of this monitor to bring it back to a healthy state. Unfortunately, with 450 disk fragmentation level warnings it’s not really viable to defragment them all with the built-in task.
The best approaches we have identified for this alert depend upon the number of systems involved. For the majority, create a group and configure them to automatically run the defragmentation recovery task. For servers which are not part of this group use the built-in task to defragment the server. Additionally, for most environments 10% is too low of a threshold value for fragmentation levels. We have configured this to 75% for our environment so that we can identify the most highly fragmented drives, defragment them and then hopefully decrease the threshold from 75% to lower levels.
Management Pack Evolution: I would like to be able to enable this recovery for all servers in the environment (and exclude specific ones) however there are a few major challenges with this approach as of this version of the management pack:
· IT organizations are extremely hesitant to automatically defragment systems because it can slow down performance of the server. If we could back this up with a counter which would monitor the actual defragmentation process this risk could be mitigated. I originally investigated using a process monitor to do this, but a process monitor won’t let you monitor between 0-1 of an item (only 1 min to 1 max, no 0 min to 1 max). If we could add functionality which monitors the actual defragmentation and alerts in case of an issue this would remove most of the hesitancy to automatically defragmenting these systems.
· On the weekly defragmentation, the state needs to be reset to green and then recalculated to determine what the actual level of fragmentation is. Otherwise this situation can occur: My drive is highly fragmented and goes to warning, I defragment it on Tuesday but by Friday it’s back over 10% so the state stays in warning. As a result since state has not changed it will not fire a recovery for this drive. Additionally, if I run the task or the recovery the state of the entity should be reset to green for the same reason. I know that I defragmented it, I hope now that it’s actually defragmented but if it’s not defragmented I want OpsMgr to run the defragment again when state changes from yellow to green.
· There should be another task available which would re-run the assessment of the fragmentation level so that we could get updated metrics for what the actual fragmentation was at that point in time (such is if I just defragmented a drive, I would like to remotely be able to know that it’s now green and that the fragmentation level is now 5% or whatever it is).
· The level of fragmentation should be stored in a performance counter so this can be trended.
I believe that this type of functionality would be a huge benefit of having Operations Manager deployed. Through this we could (to use my own terminology from the Unleashed book) automatically adapt to changing conditions like heavy defragmentation and resolve the issue before manual intervention is required. We could also effectively report that last month Operations Manager identified X number of drives which were heavily fragmented and performed Y automated defragmentation's.