KVM如何使用内存

来源：互联网发布：阿里云邮箱别名管理编辑：程序博客网时间：2024/06/03 02:06

本文根据How KVM deals with memory翻译过来。

qemu/kvm进程的运行跟普通的Linux进程一样，通过普通的malloc()或者mmap()函数来申请它自己的内存，因此，如果一个Guest需要1GB的物理内存，qemu/kvm会切实调用malloc(1<<30)，从主机虚拟内存分配1GB给该它，然而，跟普通进程一样，在调用malloc()的时候实际上并没有分配实际的物理内存，也就是，该内存并没有实际分配，直到第一次使用它。
Guest运行起来后，它将由上面malloc()申请分配的来的内存看做自己的物理内存，在Guest的内核要访问它所认为的物理内存地址0x00时，它实际所看到的是由qeme/kvm进程调用malloc()得来的内存的第一页。
过去，每一次Guest改变它的页表，主机也必须参与进来，主机需要验证Guest写入到其页表中哪些表项是有效的，而且没有访问任何不允许的内存空间，由两种机制来完成这个验证工作。
其一，是虚拟化硬件正在使用的页表，独立于Guest所认为它在使用的页表，Guest首先修改它自己的页表，接着，主机会看到这个修改，它会验证这个修改，然后才真正修改由硬件访问的页表。Guest端软件是不允许直接操作由硬件访问的页表的，这个概念被称之为影子（Shadowed）页表，在虚拟化领域是一个非常普遍的技术。
其二，是VMX/AMD-V扩展允许主机在Guest试图设置指向页表基址的寄存器（CR3）时进入trap。
虽然这种技术工作很好，但暗含了一些严重的性能问题。
一次对Guest页的访问可能会需要高达25次内存访问才能完成，这种代价非常高昂，更多相关信息学请参考文档http://developer.amd.com/assets/NPT-WP-1%201-final-TM.pdf。根本的问题是，每一对内存的访问，都必须遍历Guest的页表，然后再遍历主机的页表，这种二维查表的引入，是由于Guest的页表必须自己遍历主机的页表。同样，由主机来验证和维护一套影子页表的代价也可能非常高昂。
  因此，无论是AMD还是Intel都在积极寻找这个问题的解决办法，并推出类似的方案EPT和NPT，它们规定了一套可由硬件识别的数据结构，这套数据结构可不必遍历主机的页表就可以快速地将Guest物理地址翻译成主机物理地址，这种快捷方式消除了遍历二维页表的需要。
  由此带来的问题是，主机页表是我们用来实施如进程隔离等主要的资源，如果一页要从主机解除映射（unmapped）（比如被交换出来），那么，我们必须协调这个变化及时跟新到硬件EPT/HPT数据结构中。
  软件上的解决方案在Linux称之为mmu_notifiers，由于qemu/KVM的内存是普通的Linux内存（从主机Linux内核的角度看）内核可以尝试交换它，取代它，甚至免费，就像普通的内存一样。
  但是，在一页
实际交回给主机内核，以供其它使用之前，主机会告知KVM/qemu Guest它的意图，因此，KVM/qemu Guest然后就可以从影子页表或NPT/EPT数据结构中移除该页，在kvm/qemu Guest完成该页面移除后，主机内核就可以随意处理该页了。

一个KVM Guest物理页一生的一天生活：

Fault-in path

QEMU calls malloc() and allocates virtual space for the page, but no backing physical page
The guest process touches what it thinks is a physical address, but this traps into the host since the memory is unallocated
The host kernel sees a page fault, calls do_page_fault() in the area that was malloc()'d, and if all goes well, allocates some memory to back it.
The host kernel creates a pte_t to connect the malloc()'d virtual address to a host physical address, makes rmap entries, puts it on the LRU, etc...
mmu_notifier change_pte()?? is called, which allows KVM to create an NPT/EPT entry for the new page. (and an spte entry??)
Host returns from page fault, guest execution resumes

Swap-out path

Now, let's say the host is under memory pressure. The page from above has gone through the Linux LRU and has found itself on the inactive list. The kernel decides that it wants the page back:

The host kernel uses rmap structures to find out in which VMA (vm_area_struct) the page is mapped.
The host kernel looks up the mm_struct associated with that VMA, and walks down the Linux page tables to find the host hardware page table entry (pte_t) for the page.
The host kernel swaps out the page and clears out the pte_t (let's assume that this page was only used in a single place). But, before freeing the page:
The host kernel calls the mmu_notifier invalidate_page(). This looks up the page's entry in the NPT/EPT structures and removes it.
Now, any subsequent access to the page will trap into the host ((2) in the fault-in path above)

鉴于以上几点，应该很明显，就像在Linux上普通进程一样，主机内存分配给代表KVM Guest的主机进程可能是过量。