理解linux虚拟文件系统VFS - 数据结构

来源：互联网发布：mysql大小写敏感开关编辑：程序博客网时间：2024/06/02 17:38

file_system_type

1406 struct file_system_type {1407         const char *name;1408         int fs_flags;1409         int (*get_sb) (struct file_system_type *, int,1410                        const char *, void *, struct vfsmount *);1411         void (*kill_sb) (struct super_block *);1412         struct module *owner;1413         struct file_system_type * next;1414         struct list_head fs_supers;1415 1416         struct lock_class_key s_lock_key;1417         struct lock_class_key s_umount_key;1418 1419         struct lock_class_key i_lock_key;1420         struct lock_class_key i_mutex_key;1421         struct lock_class_key i_mutex_dir_key;1422         struct lock_class_key i_alloc_sem_key;1423 };

系统有一个file_system_type类型的全局变量file_systems，用来保存所有已经注册到系统中的文件系统。在mount文件系统时，会判断系统是否支持挂载的文件系统。

@name：文件系统的名字，这个名字唯一的标识一种文件系统

@next：通过这个结构把所有已经注册的文件系统连接到file_systems

@fs_supers：对于每一个mount的文件系统，系统都会为它创建一个super_block数据结构，该结构保存文件系统本省以及挂载点相关的信息。由于可以同时挂载多个同一文件系统类型的文件系统（比如/ 和/home都挂载了ext3文件系统），因此同一个文件系统类型会对应多个super block，@fs_supers就把这个文件系统类型对应的super block链接起来。

@owner是指向module的指针，仅当文件系统类型是以模块方式注册时，owner才有效。

@get_sb：这个函数非常重要，它VFS能够和底层文件系统交互的起始点，该函数是不能放在super_block结构中的，因为super_block是在get_sb执行之后才能建立的。get_sb从底层文件系统获取super_block的信息，是和底层文件系统相关的。

super_block

在mount文件系统时，除了在内存创建vfs_mount数据结构外，还会读取并创建这个文件系统的super block

 979 struct super_block { 980         struct list_head        s_list;         /* Keep this first */ 981         dev_t                   s_dev;          /* search index; _not_ kdev_t */ 982         unsigned long           s_blocksize; 983         unsigned char           s_blocksize_bits; 984         unsigned char           s_dirt; 985         unsigned long long      s_maxbytes;     /* Max file size */ 986         struct file_system_type *s_type; 987         const struct super_operations   *s_op; 988         struct dquot_operations *dq_op; 989         struct quotactl_ops     *s_qcop; 990         const struct export_operations *s_export_op; 991         unsigned long           s_flags; 992         unsigned long           s_magic; 993         struct dentry           *s_root; 994         struct rw_semaphore     s_umount; 995         struct mutex            s_lock; 996         int                     s_count; 997         int                     s_syncing; 998         int                     s_need_sync_fs; 999         atomic_t                s_active;1000 #ifdef CONFIG_SECURITY1001         void                    *s_security;1002 #endif1003         struct xattr_handler    **s_xattr;1004 1005         struct list_head        s_inodes;       /* all inodes */1006         struct list_head        s_dirty;        /* dirty inodes */1007         struct list_head        s_io;           /* parked for writeback */1008         struct list_head        s_more_io;      /* parked for more writeback */1009         struct hlist_head       s_anon;         /* anonymous dentries for (nfs) exporting */1010         struct list_head        s_files;1011 1012         struct block_device     *s_bdev;1013         struct mtd_info         *s_mtd;1014         struct list_head        s_instances;1015         struct quota_info       s_dquot;        /* Diskquota specific options */1016 1017         int                     s_frozen;1018         wait_queue_head_t       s_wait_unfrozen;1019 1020         char s_id[32];                          /* Informational name */1021 1022         void                    *s_fs_info;     /* Filesystem private info */1023 1024         /*1025          * The next field is for VFS *only*. No filesystems have any business1026          * even looking at it. You had been warned.1027          */1028         struct mutex s_vfs_rename_mutex;        /* Kludge */1029 1030         /* Granularity of c/m/atime in ns.1031            Cannot be worse than a second */1032         u32                s_time_gran;1033 1034         /*1035          * Filesystem subtype.  If non-empty the filesystem type field1036          * in /proc/mounts will be "type.subtype"1037          */1038         char *s_subtype;1039 };

@s_list 所有的super_block都通过s_list链接在一起

@s_dev @s_bdev标识文件系统所在的块设备，前者使用设备编号，而@s_bdev指向内存中的block_device结构

@s_blocksize @s_blocksize_bits：标识文件系统的块尺寸。这两个信息使用不同的方式表示同一样东西。前者是字节数，后者是字节数取2的对数。s_blocksize可以很容易的计算出s_blocksize_bits，所以实在没看出二者同时存在的意义，唯一能想到的是 super_block数目本来就少，每个结构增加一个4 bytes的成员变量，不会带来什么影响。

@s_dirt 表示需要把super_block的更改写回磁盘super_block

@s_maxbytes 文件系统支持的最大文件尺寸上限，这个是文件系统特定的，不同文件系统有不同的上限

@s_type 文件系统类型，指向file_systems链表中的一个节点。

@s_op 和super_block相关的操作，我们会在super_operations中介绍

@s_flags 文件系统的mount标记，这些标记在mount文件系统时，根据mount标记映射过来的。

@s_magic 每一种文件系统类型都有自己的magic。

@s_root 全局根目录的dentry项，

@s_count super block的引用计数，为0时销毁super block

@s_syncing 表示系统正在把脏inode同步回磁盘：

@s_inodes：该文件系统实例内的所有inode，都通过它们的i_sb_list成员挂接到s_inodes起始的双向链表中

@s_dirty：该文件系统实例内的所有脏inode，都通过它们的i_list成员挂接到s_dirty为表头的双向链表中。在同步内存数据到底层存储介质时，使用该链表更加高效。该链表包含已经修改的inode，因此不需要扫描全部的inode。

@s_io：该链表保存了文件系统实例中，同步代码当前考虑回写的所有inode

@s_more_io：包含那些已经被选中进行同步的inode，他们也在s_io中，但是不能一次处理完毕。

@s_files：该链表是本文件系统实例中，所有已经打开的file对象，通过file->f_list挂接到这个链表上。内核在卸载文件系统时将参考这个链表，如果发现这个链表不为空，那么表示文件系统仍然在使用中，卸载失败。

@s_instances：这个文件系统实例通过s_instances成员，链接到file_system_type中的fs_supers链表上。也就是说，对于一种文件类型file_system_type，它的成员fs_supers是一个链表，所有实例都挂接在这个链表上。

@s_fs_info：是一个指向文件系统实现的私有数据指针，VFS不操作这个数据，而是由底层文件系统操作。

inode

inode包含了文件系统中一个文件的所有信息，inode和文件系统中的文件是一一对应的。借助于inode的这些信息，文件系统可以方便的操作文件

struct inode {        struct hlist_node       i_hash;        struct list_head        i_list;        struct list_head        i_sb_list;        struct list_head        i_dentry;        unsigned long           i_ino;        atomic_t                i_count;        unsigned int            i_nlink;        uid_t                   i_uid;        gid_t                   i_gid;        dev_t                   i_rdev;        unsigned long           i_version;        loff_t                  i_size;        struct timespec         i_atime;        struct timespec         i_mtime;        struct timespec         i_ctime;        unsigned int            i_blkbits;        blkcnt_t                i_blocks;        unsigned short          i_bytes;        umode_t                 i_mode;        spinlock_t              i_lock; /* i_blocks, i_bytes, maybe i_size */        struct mutex            i_mutex;        struct rw_semaphore     i_alloc_sem;        const struct inode_operations   *i_op;        const struct file_operations    *i_fop; /* former ->i_op->default_file_ops */        struct super_block      *i_sb;        struct file_lock        *i_flock;        struct address_space    *i_mapping;        struct address_space    i_data;        struct list_head        i_devices;        union {                struct pipe_inode_info  *i_pipe;                struct block_device     *i_bdev;                struct cdev             *i_cdev;        };        int                     i_cindex;        __u32                   i_generation;        unsigned long           i_state;        unsigned long           dirtied_when;   /* jiffies of first dirtying */        unsigned int            i_flags;        atomic_t                i_writecount;        void                    *i_private; /* fs or device private pointer */};

首先是四个list_head，每个inode通过这些list_head都会同时挂接在4个链表上

@i_hash：系统中所有的inode都存放在hash_table，以方便对某个inode的快速查找，hash算法是通过超级块和inode number计算的，这两项组合可以唯一确定一个inode。i_hash用作hash表的冲突管理。

@i_list 这个成员可以把inode挂接到几个链表之一（因为这几种状态是互斥的，所以共用了这个成员），具体在哪个链表上，要根据当前inode的状态来判断。

有四个可能的状态：

1. inode在内存中，没有关联到任何文件，因此处于非激活状态。

2. inode在内存中，正在由一个或者多个进程使用，文件数据，元数据和磁盘上的数据，元数据是同步的，也就是说自从上次同步以后，再没更改过文件的数据和元数据。

3. inode在内存中，由一个以上进程使用，但是文件数据，元数据是脏的，需要同步到磁盘上。

@i_sb_list：文件必定属于某个文件系统实例，我们可以把super_block看做这个文件系统实例的代表，该文件系统实例的所有inode都通过i_sb_list挂接到super_block的s_inodes链表上。

@i_dentry：一个inode可以对应多个目录项，这些dentry项通过d_alias链接到i_dentry上。目前我只知道，硬链接会导致一个inode对应多个目录项。

@i_ino：inode 序号，在文件系统实例内是唯一存在的，inode number对底层文件系统很重要，几乎贯穿整个文件系统的管理

@i_count：inode的使用计数，inode对应的文件可以被多个进程同时访问。

@i_nlink：文件的引用计数，表示文件的硬链接数目。对于目录来说，是子目录数目，对于普通文件来说，则是为某一个文件创建的硬链接数目

@i_rdev：表示文件所在磁盘的设备号，通过这个设备号，最终可以获得目标设备block_device

@i_size：文件长度，单位为bytes

@i_blkbits：block size in number of bits，在2.6的早期版本inode结构还有一个成员叫i_blksize，占用了4bytes。二者信息是冗余的，当系统内的inode 数目很大时，该冗余信息可能会占用很打内存，而且根据i_blkbits非常容易就计算出block size，因此在后面版本被去掉了。

@i_blocks：文件的总块数。注意，i_blocks无法从i_size和i_blkbits推导出来，因为i_size表示的文件可能是一个稀疏文件。细节可查看源码中对i_blocks的使用。

@i_bytes：文件在最后一块中的字节数，同样由于文件可能存在洞，我们无法从i_size和i_blkbits推导出i_bytes。

@i_atime i_ctime i_mtime：分别为文件的最后访问时间，文件的创建时间，文件的最后修改时间。

@i_op @i_fop 前者是特定于inode相关的操作，后者则主要是文件内部数据的操作

@i_sb：该文件所属的super block

@i_data @i_mapping：地址空间对象，一个inode必然和一个地址空间对象关联，地址空间对象用来管理和这个文件相关的页面映射，address_space是内核核心抽象概念之一

@i_devices：和设备文件管理相关，一个设备可能对应多个设备文件，使用这个成员，可以把这些设备文件的inode链接起来。

@i_write_count：被write进程使用的使用计数

dentry

磁盘文件系统的目录结构保存在磁盘目录项中，而块设备读取速度慢，需要很长时间才能找到与一个文件名对应的inode，linux引入了目录项缓存来利用之前查找的结果

struct dentry {        atomic_t d_count;        unsigned int d_flags;           /* protected by d_lock */        spinlock_t d_lock;              /* per dentry lock */        struct inode *d_inode;          /* Where the name belongs to - NULL is                                         * negative */        /*         * The next three fields are touched by __d_lookup.  Place them here         * so they all fit in a cache line.         */        struct hlist_node d_hash;       /* lookup hash list */        struct dentry *d_parent;        /* parent directory */        struct qstr d_name;        struct list_head d_lru;         /* LRU list */        /*         * d_child and d_rcu can share memory         */        union {                struct list_head d_child;       /* child of parent list */                struct rcu_head d_rcu;        } d_u;        struct list_head d_subdirs;     /* our children */        struct list_head d_alias;       /* inode alias list */        unsigned long d_time;           /* used by d_revalidate */        struct dentry_operations *d_op;        struct super_block *d_sb;       /* The root of the dentry tree */        void *d_fsdata;                 /* fs-specific data */        int d_mounted;        unsigned char d_iname[DNAME_INLINE_LEN_MIN];    /* small names */};

@d_count：dentry的引用计数，当创建dentry的子dentry时，会增加父dentry的引用计数；当把inode关联到dentry时，也会增加dentry的引用计数

@d_inode：dentry对应的inode，这个是dentry最中要的成员，因为dentry的主要做用就是通过路径名查找inode

@d_hash：内存中所有的dentry都保存在hash表中d_hash是为了处理hash冲突的

@d_parent：指向这个dentry的父dentry，对于根目录d_parent指向自身

@d_name：指定了文件的名称，qstr是一个包装器，存储了字符串的长度，hash值和字符串本身。字符串不是一个绝对路径，而是当前的分量。如果文件的名称小于DNAME_INLINE_LEN_MIN，那么d_name->name指向d_iname，否则要通过kmalloc进行分配。

@d_lru：表头是dentry_unused，所有引用计数为0的dentry都会放到这个LRU链表中，并且插在链表前面，因此靠后的节点，表示越老。

@d_child：用于将当前dentry链接到父dentry的d_subdirs

@d_subdirs：所有的子dentry都通过他们的d_child链接到父亲的d_subdirs

@d_sb：指向该目录项所在文件系统实例的超级块

file

虽然file看起来也是表示一个文件，但是要记住，file是进程相关的；而inode是进程无关的。因此一个文件在系统内只有一个inode；而一个文件在系统内可能有多个file结构，分别属于不同的进程。

struct file {        struct path             f_path;#define f_dentry        f_path.dentry#define f_vfsmnt        f_path.mnt        const struct file_operations    *f_op;        atomic_t                f_count;        unsigned int            f_flags;        mode_t                  f_mode;        loff_t                  f_pos;        struct fown_struct      f_owner;        unsigned int            f_uid, f_gid;        struct file_ra_state    f_ra;        u64                     f_version;#ifdef CONFIG_SECURITY        void                    *f_security;#endif        /* needed for tty driver, and maybe others */        void                    *private_data;#ifdef CONFIG_EPOLL        /* Used by fs/eventpoll.c to link all the hooks to this file */        struct list_head        f_ep_links;        spinlock_t              f_ep_lock;#endif /* #ifdef CONFIG_EPOLL */        struct address_space    *f_mapping;};

f_count：引用计数，使用file对象的进程数目。比如使用CLONE_FILES创建进程时，这些进程会共享打开的文件，因此会使用相同的file对象。

f_mode：打开文件时，传递的模式参数，保存在f_mode中

f_flags：打开文件时，传递的打开标志，保存在f_flags中

f_pos：是一个很重的值，这个值表示文件的读写位置，这个值是不可以放在inode中的，因为一个inode可能对应多个打开的file结构。

f_dentry：提供了打开文件file和这个文件inode之间的关联

f_vfsmnt：所在文件系统的信息

vfsmount

每一个被装载的系统都对应着一个vfsmount的实例

struct vfsmount {        struct list_head mnt_hash;        struct vfsmount *mnt_parent;    /* fs we are mounted on */        struct dentry *mnt_mountpoint;  /* dentry of mountpoint */        struct dentry *mnt_root;        /* root of the mounted tree */        struct super_block *mnt_sb;     /* pointer to superblock */        struct list_head mnt_mounts;    /* list of children, anchored here */        struct list_head mnt_child;     /* and going through their mnt_child */        int mnt_flags;        /* 4 bytes hole on 64bits arches */        char *mnt_devname;              /* Name of device e.g. /dev/dsk/hda1 */        struct list_head mnt_list;        struct list_head mnt_expire;    /* link in fs-specific expiry list */        struct list_head mnt_share;     /* circular list of shared mounts */        struct list_head mnt_slave_list;/* list of slave mounts */        struct list_head mnt_slave;     /* slave list entry */        struct vfsmount *mnt_master;    /* slave is on master->mnt_slave_list */        struct mnt_namespace *mnt_ns;   /* containing namespace */        /*         * We put mnt_count & mnt_expiry_mark at the end of struct vfsmount         * to let these frequently modified fields in a separate cache line         * (so that reads of mnt_flags wont ping-pong on SMP machines)         */        atomic_t mnt_count;        int mnt_expiry_mark;            /* true if marked for expiry */        int mnt_pinned;};

@mnt_mountpoint 是当前文件系统的装载点在其父目录中的dentry结构。

@mnt_root 当前文件系统根目录的dentry

@mnt_parent 指向父文件系统的vfsmount

@mnt_sb 指向与这个vfsmount相关的超级块，对于每一个装载的文件系统，都有且只有一个super block实例

@mnt_mounts 表头节点，是子文件系统链表的表头

@mnt_child vfsmount通过mnt_child挂接到父vfsmount的mnt_mounts链表上。

@mount_count 是一个计数值，每当我们使用这个vfsmount之前，都要通过mntget增加它的引用计数，如果不再使用则调用mntput减少引用计数。