Wednesday, June 3, 2015

Migration of backup software from NetApp tools to Veeam

We weren't exactly happy with the reporting of VSC, SMVI and Snapvault.  We were also a bit afraid of the black beast of vendor lock-in.  So we chose for what most people consider the market leader for agentless VM backups: Veeam.
If you're in an similar situation I can only recommend to go for a partner which knows both NetApp and Veeam.

We hadn't removed certain scheduled smvi tasks and this gave trouble in combination with Veeam.  A NetApp savvy consultant probably would have noticed this during installation.

Symptoms of the problem: Friday at 22:06,  vCenter suddenly stops the virtual machine which is used both as backup and proxy server.

VMware tries to restart the VM several times but fails to do so because a (virtual disk) appears to be corrupt.  To make things worse, this 'corrupt' virtual disk belongs to another VM...  The VM also refuses to start unless we delete this virtual disk from the configuration (don't physically delete the virtual disk!)

To understand what happened you have to know how a Veeam backups a VM when in virtual appliance mode.  This is taken from the manual:

  • The backup proxy sends a request to the ESX(i) host to locate the necessary VM on the datastore
  • The ESX(i) host locates the VM.
  • Veeam Backup & Replication triggers VMware vSphere to create a VM snapshot.
  • VMware vSphere creates a linked clone VM from the VM snapshot. 
  • Disks of a linked clone VM are hot-added to the backup proxy or helper VM.
  • Veeam Backup & Replication reads data directly from disks attached to the backup proxy or helper VM through the ESX(i) I/O stack. When the backup process is complete, disks are detached from the backup proxy or helper VM.

So the mystery of the virtual disk belonging to another VM is solved.  This is the remnant of a (failed) backup.   But why around 22.00?  Of course this is no coincidence either.  At 22.00 the backup kicks in, but something causes an event which scares VMware so much, it just shuts the server down...

So I'm looking at something that happens around 22.00 to that machine.  Maybe another backup that also uses VSS?  Now NetApp based snapshots are suspect.  Then I notice a smvi script is still scheduled around this time.  We forgot to disable this schedule.  

After deleting these scripts the problem was gone.