Friday, 17 July 2020

Problems passing through GPU to an unRaid VM?

One of the most common problems I see people seeking help for on unRaid forums, reddit and discord is the thorny issue of GPU pass-through. People with the same problems tend to ask the same questions over and over, and, somewhat tired of providing the same troubleshooting steps over and over, I thought I'd write up the main pitfalls, and possible solutions.

Here are some steps to check before reaching out for help.

1. Have you done your research?

As noted, there's a ton of threads on unRaid forums covering this, many successfully resolved. Do conduct a search there. For example, searching 'problem with gpu passthrough' today produces 40 pages of results. Chances are, your issue has been raised and resolved several times before.

Another oft-posted remark is something along the lines of 'I followed spaceinvaderone's video, and it still doesn't work'.

In most cases where I've seen this, the user has incorrectly followed the video - missing a key step or subtle point. This is NOT a straightforward process. There are many variables involved and SpaceInvaderOne's videos are packed with information. He does an amazing job of explaining the complex process step by step, so make sure to rewatch these videos several times to ensure you haven't missed anything.

If you haven't been following one of his videos for GPU pas-through, why not? Get very familiar with the process and follow along before posting queries;

Here are some links to his channels;

Research done? Let's look at some of the common missteps....

2. Check IOMMU Groups

For pass-through of any device to work, you need to ensure your system is separating the pass-through devices into discrete IOMMU groups. In unRaid, go to Tools -> System Devices and have a look at the 'PCI Devices and IOMMU Groups' section. Here's a section of mine;

Observe here how various devices are 'grouped'. If you are passing through hardware to a VM, all devices in a group must be passed through, or it wont work.

Have a look at group 45. In my case, I have 2x legacy M-Audio multichannel PCI sound cards that I use for whole-house audio. These are connected to a PCI riser card (the TI devices in this group). I have all these devices passed through to a Windows 10 VM that looks after the management of this whole house audio system. It would not be possible for me to pass through just one of these cards by itself, or have the two cards passed to different VMs.

Look at Group 41. This is one of the GPUs in the system. See that the two parts of the device, video and audio, are listed here, but there's nothing else in the group. This is ideal and means that this GPU can be passed through to a VM.

If your listing looks like this for the device you are trying to pass through, skip ahead.

If, however, you were to see other devices in this group, you'd need to either pass through all the devices, or work on getting the groups to redefine themselves.

Sometimes it's OK to pass through all devices, as in my example above, but if the GPU was bundled with, say, your on-board ethernet, or a USB controller that you're using for your boot thumbdrive, you'll need to get them separated.

How you go about this, and whether you can achieve it at all, will be dependant on your hardware, most importably your motherboard.

Many motherboards have a setting in the bios to turn on or configure IOMMU groups. You need to ensure this is enabled. To figure out how to do it for your motherboard, consult the user manual, manufacturer support site, forums or other reliable internet resources, including unRaid forums.

If, having enabled IOMMU in bios, you still don't see good separation / grouping, all is not lost. In unRaid, go to Settings -> VM Manager and switch on 'Advanced View' (top right of the screen). This will expose the PCIe ACS override settings (turn on help to get an overview of what this does).

Again, whether and how this works will be dependent on your own setup and hardware, but you can cycle through the options here, reboot and see if there are any changes. You can also toggle unsafe interrupts on and off in conjunction with each override setting to see if that helps in any way.

3. Use SeaBIOS

When configuring a VM for pass-through, be sure to set the BIOS to 'SeaBIOS'. By far, this is the biggest single cause of failure I have encountered.

You cannot change this after VM creation, so to switch, you need to create a new VM altogether.

The machine type is less relevant, but I've had good luck solving problems by switching from Q35 to i440fx, or vice versa. While you can switch this setting on a created VM, I've found that unRaid will often throw errors about PCI bus ids and the like. These may be overcome with manual editing of the XML file, but it's often easier to just recreate the VM.

4. Configure the device correctly

When you set up or edit a VM in unRaid, there's an easy to use GUI editor that allows you configure the VM through checkboxes and dropdowns. This generates the XML file thats needed by the VM engine to spool up the VM. (You can access and edit the underlying XML code by clicking the 'Form View' toggle on the top right of the VM Edit page).

This is great, but there's a long-standing issue (bug?) in unRaid VM manager,  When a GPU is assigned to a VM, unRaid incorrectly splits the Video and Audio portions of the GPU in the XML so as to make it appear to the VM that they are two discrete devices in separate (virtual) slots.

In a real machine, you'd never have the video and audio portions of a single GPU in different PCIe slots, so why should this be the case in a VM? This can cause all kinds of issues so needs to be fixed. You need to access the XML file to resolve this.

Unfortunately, there's another long standing issue/bug in unRaid that comes in to play here. If you make any custom changes to a VM XML file, the next time you make an edit to the same VM via the GUI editor, those changes will be wiped. This is a royal PITA. The best advice here is, once you make an XML change, copy and save that XML file somewhere so you can easily revisit it in the future and be able to paste changed segments in again.

That said, here's what to look for and fix;

You will typically adda GPU using the GUI editor like this, selecting the Video and Audio portions appropriately;

Note in the case of my example the numbers in brackets after the device name. This is the physical address of that device in my unRaid server. In this case, the video portion is on Bus 9, Slot 0, Device 0, while the audio portion is on Bus 9, Slot 0, Device 1.

So, both parts of this device are in the same slot. Now, lets switch to XML view and see what has unRaid done;

If you scroll towards the bottom of the XML, you'll find 2x 'hostdev' blocks that contain the GPU details. In the example above, I have added line breaks and highlighting to differentiate them, it wont be this clear for you, but they are typically added towards the end of the file right after the mouse and keyboard input nodes.

You can see here that each 'hostdev' block equates to one of the devices you set up in the visual editor. There may be other devices if you've passed through anything else, but you can tell which is the GPU by looking for the physical address details, in my case at lines 159 and 167.

Essentially, what's going on here is this XML file is telling the VM engine to map those specific devices to virtual device addresses inside the VM. These virtual addresses are defined at lines 161 and 169.

If you look carefully, you'll see that the default mapping is as follows;

Bus 9, Slot 0 Function 0 (video device) maps to Bus 5, Slot 0, Function 0
Bus 9, Slot 0 Function 1 (audio device) maps to Bus 6, Slot 0, Function 0

So, whereas the physical device is connected to a common slot, the virtual device is split across different busses. We need to fix that so we have consistency (changes highlighted);

Bus 9, Slot 0 Function 0 (video device) maps to Bus 5, Slot 0, Function 0
Bus 9, Slot 0 Function 1 (audio device) maps to Bus 5, Slot 0, Function 1

Here's the updated XML;

The eagle-eyed may have noticed another change. At line 161, we have added multifunction='on' to the virtual video device. This tells the VM that this is a multifunction device and to expect further devices in the same slot.

When making these changes, it's critical that you take great care to preserve the data integrity of the XML file. A comma or quotation out of place will cause problems, so always check.

With those changes made, make and save a copy of the XML for future reference and click the 'update' button. remember, if you subsequently make changes to this VM via the GUI editor, these changes will be lost and you'll need to apply them again. For example, if you were to adjust the memory or CPU allocation to the VM, you'd need to go to your saved XML and copy the video portion and paste into the VM editor.

5. Adding GPU bios

Before progressing here, I will admit that I've rarely had an incidence where I've found adding a custom bios or not has made any difference to achieving GPU passthrough. I'v implemented VM passthrough across Windows and MacOS VMs, with AMD and nVidia cards, in cases where I had one GPU and multiple GPUs. I have added bios files, but never had a case where having or not having the file made any appreciable difference.

Thats not to say it will not work for you, and people do report success with this step. I can also imagine scenarios where people may be using an already flashed card, say with a mining optimised bios, and want to have their VM use the standard or default bios.

In any case, I cannot write this up any better than the process is described in this video, so just do this;

6. Dealing with a black screen

Another very common problem is that the VM works fine with VNC, but as soon as the GPU is added, any screen attached will stay resolutely blank.

There can be may causes of this, and it's often a symptom of something else. The main problem is, you cannot see any progress or errors, so you've no idea whats going wrong. I've had instances where my windows VM was dropping into recovery mode, but I could not see that. Problems like this are often resolved through the steps above, but if not some additional troubleshooting is required.

In the first instance, it's important that you give yourself some means of accessing your VM other than VNC. VNC is great for setup, but when you add your GPU, this is removed, and you have no idea what's happening.

After VM setup and before adding a GPU, its recommend to add some kind of remote access service to the VM to allow you remote in, add drivers etc. DON'T use RDC on windows for this as that bypasses the GPU with a MS display driver, so is useless for our purposes. I recommend Splashtop, which has worked well for me, and is free for up to 5 remote machines.

Once remote access is set up and tested with the VM in VNC mode, there is one further configuration I would recommend. Set the VM to auto-login. Splashtop won't expose the remote system for control until it's at desktop, and if you cannot see the login screen to log in, it will be useless. On windows, you do this with the netplwiz utility (search for it in windows). This allows you set a user to login automatically, bypassing the login screen.

Once all that's done and confirmed working, you can switch from VNC to the passthrough GPU.

If you find you're staring at the dreaded blank screen, use the remote access tool to log in and explore Windows Device Manager to see if the GPU is listed, if there are any errors etc. It might be just a case of installing drivers, or dealing with whatever error is presented.

If, however, you cannot remote in, it's likely your VM has not booted to the desktop and is stuck somewhere in the boot process. You need to conduct further troubleshooting. For this, I'd recommend using the VIRT-Manager docker.

7. Using VIRT-Manager

VIRT-Manager is an alternative way to set up and configure virtual machines. It's a docker. To get it installed, you'll need the Community Applications plug-in installed, then download VIRT-Manager from the community applications app store.

VIRT-Manager allows for finer control and access to additional features of VM configuration than that provided in the unRaid VM GUI. It accesses the same XML file that the unRaid UI does, so changes made in one will be reflected in the other.

For our purposes, VIRT-Manager is a great way to install a choice of virtual adapters alongside a pass-through GPU. This is a step up from the remote access through Splashtop in that the virtual adapter  is enabled from launch, allowing you to view the boot process within VIRT-manager.

Once you have a VM up and running and a GPU assigned through unRaid, if you still have a blank screen, you can launch VIRT-Manager and click through to the VMs configuration screen (ensure VM is stopped).

Click 'Add Hardware' and choose Graphics to add a Spice or VNC adapter. (you can easily change from one to the other afterwards). Once added, you will then have a Video device which can be adjusted to specific a range of virtual video devices. I've found VGA or QXL modes to work well. 

You may need to adjust the display or video settings to find a combination that works well for you, but ultimately you should be able to boot the VM and observe it's boot progress right in the VIRT-manager browser window on the virtual console;

Now, your VM sees two displays and you can view and interact with the virtual display to see what's going on.

Note that you may need to further tweak the VM XML. If you add a display in this way, the ordering of the virtual display adapters in the virtual slots may catch you out. You might need to revert to the unRaid VM manager, edit the XML and look at the virtual slot assignments. Here is the XML for the above configuration;

Again, I have separated the nodes out a little for clarity. I have highlighted the newly added virtual adapter. The passthrough GPU is in the hostdev nodes beneath.

Observe that the virtual adapter is configured on bus 9, slot 1 (line 162), but the pass-through GPU is configured on bus 5, slot 0 (lines 170 & 178).

This means that the pass-through adapter is the first GPU by virtual slot sequence, and may be considered the primary display. To make the virtual adapter the primary, the bus id should be lower than that of the GPU or, if they are the same, the slot id should be lower. You can easily make this change to the XML to force the virtual adapter be the first found.

8. nVidia Error (-43)

If everything else is working, but you have a very low resolution display in windows and are using a nVidia card, check the card in device manager to see if its reporting a -43 error. If so, it's likely that the nVidia drivers have detected that they are running in a virtual environment. nVidia really doesn't like running in a virtual environment, and will often fail with this -43 error in such circumstances.

To resolve it, you can add some additional information to the VM XML to hide from the OS the fact that's it's virtualised.

Add the following edits to the </features> section of the XML. ˜The highlighted lines are added in addition to the standard unRAID config;

      <relaxed state='on'/>
      <vapic state='on'/>
      <spinlocks state='on' retries='8191'/>
      <vendor_id state='on' value='1234567890ab'/>
      <hidden state='on'/>
    <ioapic driver='kvm'/>

In these circumstances, you may need a bios passthrough, and may need to use OVMF bios type - some experimentation may be required.

9. Still not working?

If you've done all this, and it's still not working, you need to start in-depth troubleshooting. As with all technical issues, it's best to reduce  the problem to its minimum state. In this case, if you have any other devices passed through to the VM, or any other custom configuration, remove them. Try to deal with a single passthrough device and no other configs that might be getting in the way. 

If you have multiple GPUs in your system, remove some others, or swap slots. I had a strange issue for a time where with two almost identical RX570 cards installed, one would fail to work in a particular slot, but when  I swapped them, both worked.

It's always worth checking for motherboard bios updates, or resetting bios to factory. You can also try different versions of unRaid, or different versions of Q35 or i440 machine types in VM set up (though I usually find the latest is always the best to go for).

You might also have an issue with GPU compatibility on your target OS. For example, MacOS beyond High-Sierra has very limited support for nVidia cards at all. Only a few older ones will work, and people trying to pass a 1660Ti to Catalina are often caught out by this. 'It works OK in a windows VM' is the usual starting point.

If you've got this far, and there are still problems, make a detailed post in unRaid forums with as much information as possible about your system, the trouble-shooting you've tried, and  be sure to include a full diagnostics output. (Tools->Diagnostics). People there will be much more likely to assist if they see you've put in the effort off your own bat.

And finally, do persevere. You'll absorb all the above quite quickly and be spooling up VMs like a pro in no time. And it's all very much worthwhile.

Good luck!

Note: If you find any errors with the above, have further thoughts on what could be added or updated, have additional steps you think could be added or have a solution that worked for you, please leave a comment and I'll update the post.


Anonymous said...

Thanks a lot for complying such a complete passing through GPU guide!!
I found a lot of guides like missing puzzles floating around internet, but your guide is so far the most complete one I can found. Successfully passthrough my 1060 and RX570 to unraid with this guide!
Thanks a lot!

Anonymous said...

Thank you! My VM randomly broke after working for a while. I tried lots of things and none of them worked. The fix was to add the "hyperv" and "kvm" sections to the features, as you described in step 8. I just copy/pasted and it worked!