Chapter 12: Shared Memory Virtual Filesystem

Sharing a region of memory backed by a file or device is simply a case of calling mmap() with the MAP_SHARED flag.
Regardless, there are 2 important cases where an anonymous region needs to be shared between processes:

When an mmap() with MAP_SHARED is used without file backing - these regions will be shared between a parent and a child process after a fork() is executed.
A region explicitly sets up an anonymous mapping via (userland calls) shmget() and attaches to the virtual address space with shmat().

When pages within a VMA are backed by a file on disk, the interface used is fairly straightforward. To read a page during a page fault, the required nopage() function is found in the struct vm_area_struct->vm_ops field and the required writepage() function is found in the struct address_space_operations using inode->i_mapping->a_ops or alternatively page->mapping->a_ops.
When normal file operations are taking place such as mmap(), read() and write(), the struct file_operations with the appropriate functions is found using inode->i_fop. The diagram shown in 4.2 goes into more detail.
The interface is nice and clean and is conceptually easy to understand, but it does not help anonymous pages as there is no file backing there.
To keep this nice interface, linux creates an artificial file backing for anonymous pages using a RAM-based filesystem where each VMA is backed by a file in this filesystem.
Every inode in the artificial filesystem is placed on a linked list, shmem_inodes, so they can be easily located.
Overall this approach allows the same file-based interface to be used without treating anonymous pages as a special case.
The filesystem comes in two variations in 2.4.22 - shm and tmpfs. They share core functionality and mainly different in what they're used for - shm is used by the kernel for creating file backings for anonymous pages and for backing regions created by shmget(), the filesystem is mounted by kern_mount() so it's mounted internally and isn't visible to users.
tmpfs is a temporary file system that is often mounted at /tmp publicly as as fast RAM-based temp fs. A secondary use for tmpfs is to mount /tmp/shm which allows mmap()-ing of files there in order to share information as a form of IPC. Regardless of the type of use, a sysadmin has to explicitly mount the tmpfs.

12.1 Initialising the Virtual Filesystem

The virtual filesystem is initialised by init_tmpfs() either during system start or when the module is being loaded. This does the following:

Mounts tmpfs and shm, with shm mounted internally via kern_mount().
Calculates the maximum number of blocks and inodes that can exist in the filesystem.
As part of the registration, the function shmem_read_super() is used as a callback to populate a struct super_block with more information about the filesystems, such as making the block size equal to the page size.

Every inode created in the filesystem will have a struct shmem_inode_info associated with it containing private information specific to the filesystem.
SHMEM_I() takes an inode as a parameter and returns a pointer to a struct shmem_inode_info, bridging the gap between the two. Taking a look at the latter:

struct shmem_inode_info {
        spinlock_t              lock;
        unsigned long           next_index;
        swp_entry_t             i_direct[SHMEM_NR_DIRECT]; /* for the first blocks */
        void                    **i_indirect; /* indirect blocks */
        unsigned long           swapped;    /* data pages assigned to swap */
        unsigned long           flags;
        struct list_head        list;
        struct inode            *inode;
};

Considered each field:

lock - Spinlock protecting the inode information from concurrent accesses.
next_index - Index of the last page being used in the file. This will be different from inode->i_size when a file is being truncated.
i_direct - direct block containing the first SHMEM_NR_DIRECT swap vectors in use by the file (see 12.4.1.)
i_indirect - Pointer to the first indirect block (see 12.4.1.)
swapped - Count of the number of pages belonging to the file that are currently swapped out.
flags - Currently only used to determine if the file belongs to shared region set by shmget(). It is set by specifying SHM_LOCK when calling shmget() and unlocked via SHM_UNLOCK.
list - List of all inodes used by the filesystem.
inode - Pointer to the parent inode.

12.2 Using shmem Functions

Different structs contain pointers for shmem-specific functions. Regardless, in all cases tmpfs and shm share the same structs.
For faulting in pages and writing them to backing storage, two structs - a struct address_space_operations, shmem_aops and a struct vm_operations_struct, shmem_vm_ops.
Taking a look at the address space operations struct shmem_aops:

static struct address_space_operations shmem_aops = {
        removepage:     shmem_removepage,
        writepage:      shmem_writepage,
#ifdef CONFIG_TMPFS
        readpage:       shmem_readpage,
        prepare_write:  shmem_prepare_write,
        commit_write:   shmem_commit_write,
#endif
};

The most important function referenced here is shmem_writepage(), which is called when a page is moved from the page cache to the swap cache.
shmem_removepage() is called when apage is removed from the page cache so that the block can be reclaimed.
shmem_readpage() is not used by tmpfs, but is provided so the sendfile() system call can be used with tmpfs files.
Similarly, shmem_prepare_write() and shmem_commit_write() are also unused, but provided so that tmpfs can be used with with the loopback device.
Taking a look at struct shmem_vm_ops:

static struct vm_operations_struct shmem_vm_ops = {
        nopage:         shmem_nopage,
};

shmem_vm_ops is used by anonymous VMAs as their struct vm_operations_struct so that shmem_nopage() is called when a new page is faulted in.
To perform operations on files and inodes a struct file_operations and a struct inode_operations are required. The former is provided as shmem_file_operations:

static struct file_operations shmem_file_operations = {
        mmap:           shmem_mmap,
#ifdef CONFIG_TMPFS
        read:           shmem_file_read,
        write:          shmem_file_write,
        fsync:          shmem_sync_file,
#endif
};

Three sets of struct inode_operations are provided - shmem_inode_operations, shmem_dir_inode_operations, and a related pair - shmem_symlink_inline_operations shmem_symlink_inode_operations for handling file inodes, directory inodes and symlink inodes respectively.
The two file operations supported are truncate() and setattr():

static struct inode_operations shmem_inode_operations = {
        truncate:        shmem_truncate,
        setattr:        shmem_notify_change,
};

shmem_truncate() is used to truncate a file. since shmem_notify_change() is called when file attributes change it's possible (amongst other things) for a file to be grown with truncate() and to use the global zero page as the data page.
The directory struct inode_operations provides a number of functions:

static struct inode_operations shmem_dir_inode_operations = {
#ifdef CONFIG_TMPFS
        create:         shmem_create,
        lookup:         shmem_lookup,
        link:           shmem_link,
        unlink:         shmem_unlink,
        symlink:        shmem_symlink,
        mkdir:          shmem_mkdir,
        rmdir:          shmem_rmdir,
        mknod:          shmem_mknod,
        rename:         shmem_rename,
#endif
};

Finally we have the pair of symlink struct inode_operationss:

static struct inode_operations shmem_symlink_inline_operations = {
        readlink:       shmem_readlink_inline,
        follow_link:    shmem_follow_link_inline,
};

static struct inode_operations shmem_symlink_inode_operations = {
        truncate:       shmem_truncate,
        readlink:       shmem_readlink,
        follow_link:    shmem_follow_link,
};

The difference between the two readlink() and follow_link() functions is related to where the information is stored. A symlink inode does not require the private inode information stored in a struct shmem_inode_info.
If the length of the symbolic link name is smaller than a struct shmem_inode_info, the space in the inode is used to store the name and shmem_symlink_inline_operations becomes the relevant struct inode_operations struct.
Otherwise, a page is allocated via shmem_getpage(), the symbolic link is copied to it and shmem_symlink_inode_operations is used.
The inode symlink struct contains a truncate() function so the page will be reclaimed when the file is deleted.
These structs ensure that the shmem equivalent of inode-related operations will be used when regions are backed by virtual files. The majority of the VM sees no difference between pages backed by a real file and those backed by virtual files.

12.3 Creating files in tmpfs

Because tmpfs is mounted as a proper filesystem that is visible to the user it must support directory inode operations, as provided for by shmem_dir_inode_operations, discussed in the previous section.
The implementation of most of the functions used by shmem_dir_inode_operations are rather small but they are all very interconnected.
The majority of the inode fields provided by the VFS are filled by shmem_get_inode().
When creating a new file shmem_create() is used - this is a small wrapper around shmem_mknod() which is called with the S_IFREG flag set so a regular file will be created.
Again shmem_mknod() is really just a wrapper around shmem_get_inode() which creates an inode and fills in the struct inode fields as required.
The 3 most interesting fields that are populated are inode->i_mapping->a_ops, inode->i_op and inode->i_fop fields.
After the inode has been created, shmem_mknod() updates the directory inode's size and mtime statistics before creating the new inode.
Files are created differently in shm despite the fact the filesystems are basically the same - this is looked at in more detail in 12.7.

12.4 Page Faulting Within a Virtual File

When a page fault occurs, do_no_page() will invoke struct vm_area_struct->vm_ops->nopage if it exists. In the case of the virtual file filesystem, this means the function shmem_nopage() will be called.
The core of shmem_nopage() is shmem_getpage() which is responsible for either allocating a new page or finding it in swap.
This overloading of fault types is unusual because do_swap_page() is usually responsible for locating pages that have been moved to the swap cache or backing storage, using information encoded within the relevant PTE.
In the case of a virtual file, the relevant PTEs are set to 0 when they are moved to the swap cache. The inode's private filesystem data stores direct and indirect block information which is used to locate the pages later.
Other than this, the overall operation is very similar to standard page faulting.

12.4.1 Locating Swapped Pages

When a page has been swapped out, a struct swp_entry_t will contain the information needed to locate the page again - instead of using PTEs to find it, we store this in the inode's filesystem-specific private information (LS - there doesn't appear to be a private data field for an inode, perhaps this is stored elsewhere?)

shmem_swp_alloc

When faulting, shmem_swp_alloc() is used to locate the swap entry (LS - couldn't be certain, book refers to shmem_alloc_entry() which doesn't exist, this seem most likely to be the correct function.)
Its basic task is to do some sanity checks (ensuring shmem_inode_info->next_index always points to the page index at the end of the virtual file) before calling shmem_swp_entry() to search for the swap vector within the inode information, and if we can't find it allocate a new one.
The first SHMEM_NR_DIRECT entries are stored in struct inode->i_direct. This means that for i386, files smaller than 64KiB (SHMEM_NR_DIRECT * PAGE_SIZE) will not need to use indirect blocks. Larger files must use indirect blocks starting with inode->i_indirect.
The initial indirect block (inode->i_indirect) is broken into two halves - the first containing pointers to 'doubly-indirect' blocks, and the second half containing pointers to 'triply-indirect' blocks.
The doubly-indirect blocks are pages containing swap vectors (struct swp_entry_t), and the triply-indirect blocks contain pointers to pages, which in turn are filled with swap vectors.
This relationship means that the maximum number of pages in a virtual file (SHMEM_MAX_INDEX) is defined as follows:

#define SHMEM_MAX_INDEX  (SHMEM_NR_DIRECT + (ENTRIES_PER_PAGEPAGE/2) * (ENTRIES_PER_PAGE+1))

Looking at how indirect blocks are traversed, diagrammatically:

                                                     double-
                                                 indirect block
                                                      pages
                                            /--->--------------- /------>---------------
                                            |    | swp_entry_t | |       | swp_entry_t |
                                            |    |-- - -- - -- | |       |-- - -- - -- |
                                            |    | swp_entry_t | |       | swp_entry_t |
                                            |    |-- - -- - -- | |       |-- - -- - -- |
                                            |    | swp_entry_t | |       | swp_entry_t |
           ---------------------            |    |-- - -- - -- | |       |-- - -- - -- |
           | inode->i_indirect |            |    | swp_entry_t | |       | swp_entry_t |
           ---------------------            |    --------------- |       ---------------
                     |                      |                    |
                     |                      | /->--------------- | /---->---------------
                     v indirect block page  | |  | swp_entry_t | | |     | swp_entry_t |
                     ---------------------- | |  |-- - -- - -- | | |     |-- - -- - -- |
                   ^ |                   ---/ |  | swp_entry_t | | |     | swp_entry_t |
ENTRIES_PER_PAGE/2 | |-- - -- - -- - -- - |   |  |-- - -- - -- | | |     |-- - -- - -- |
                   v |                   -----/  | swp_entry_t | | |     | swp_entry_t |
                     |--------------------|      |-- - -- - -- | | |     |-- - -- - -- |
                   ^ |                   -----\  | swp_entry_t | | |     | swp_entry_t |
ENTRIES_PER_PAGE/2 | |-- - -- - -- - -- - |   |  --------------- | |     ---------------
                   v |                   ---\ |                  | |
                     ---------------------- | |      triple-     | |  /->---------------
                                            | |  indirect block  | |  |  | swp_entry_t |
                                            | |       pages      | |  |  |-- - -- - -- |
                                            | \->--------------- | |  |  | swp_entry_t |
                                            |    |            ---/ |  |  |-- - -- - -- |
                                            |    |-- - -- - -- |   |  |  | swp_entry_t |
                                            |    |            -----/  |  |-- - -- - -- |
                                            |    |-- - -- - -- |      |  | swp_entry_t |
                                            |    |            --------/  ---------------
                                            |    |-- - -- - -- |
                                            |    |            ---------->---------------
                                            |    ---------------         | swp_entry_t |
                                            |                            |-- - -- - -- |
                                            \--->---------------         | swp_entry_t |
                                                 |            --->       |-- - -- - -- |
                                                 |-- - -- - -- |         | swp_entry_t |
                                                 |            --->       |-- - -- - -- |
                                                 |-- - -- - -- |         | swp_entry_t |
                                                 |            --->       ---------------
                                                 |-- - -- - -- |
                                                 |            --->
                                                 ---------------

12.4.2 Writing Pages to Swap

The function shmem_writepage() is the function registered in the filesystem's struct address_space_operations for writing pages to the swap. It's responsible for simply moving the page from the page cache to the swap cache, implemented as follows:

Record the current struct page->mapping and information about the inode.
Allocate a free slot in the backing storage with get_swap_page().
Allocate a struct swp_entry_t via shmem_swp_entry().
Remove the page from the page cache.
Add the page to the swap cache. If it fails, free the swap slot add back the page cache, and try again.

12.5 File Operations in tmpfs

Four operations are supported with virtual files - mmap(), read(), write() and fsync() as stored in shmem_file_operations (discussed in 12.2.)
There's nothing too unusual here:

mmap() is implemented by shmem_mmap(), and it simply updates the VMA that is managing the mapped region.
read() is implemented by shmem_file_read() which performs the operation of copying bytes from the virtual file to a userspace buffer, faulting in pages as necessary.
write() is implemented by shmem_file_write() and is essentially the same deal as read().
fsync() is implemented by shmem_sync_file() but is essentially a NULL operation because it doesn't do anything and simply returns 0 for success - because the files only exist in RAM, they don't need to be synchronised with any disk.

12.6 Inode Operations in tmpfs

The most complex operation that is supported for inodes is truncation and involves four distinct stages:

shmem_truncate() - truncates a partial page at the end of the file and continually calls shmem_truncate_indirect() until the file is truncated to the proper size. Each call to shmem_truncate_indirect() will only process one indirect block at each pass, which is why it may need to be called multiple times.
shmem_truncate_indirect() - Deals with both doubly and triply-indrect blocks. It finds the next indirect block that needs to be truncated to pass to stage 3, and will contain pointers to pages which will in turn contain swap vectors.
shmem_truncate_direct() - Selects a range of swap vectors from a page that need to be truncated and passes that range to shmem_free_swp().
shmem_free_swp() - Frees entries via free_swap_and_cache() which frees both the swap entry and the page containing data.

The linking and unlinking of files is very simple because most of the work is performed by the filesystem layer.
To link a file, the directory inode size is incremented, and ctime and mtime of the affected inodes are updated and the number of links to the inode being linked is incremented. A reference to the new struct dentry is then created via dget() and d_instantiate().
To unlink a file the same inode statistics are updated before decrementing the references to the struct dentry only using dput() (and subsequently iput()) which will clear up the inode when its reference count hits zero.
Creating a directory will use shmem_mkdir() which in turn uses shmem_mknod() with the S_IFDIR flag set, before incrementing the parent directory struct inode's i_nlink counter.
The function shmem_rmdir() will delete a directory, first making sure it's empty via shmem_empty(). If it is, the function then decrements the parent directory struct inode->i_nlink count and calls shmem_unlink() to remove it.

12.7 Setting Up Shared Regions

A shared region is backed by a file created in shm.
There are two cases where a new file will be created - during the setup of a shared region via (userland) shmget() or when an anonymous region is set up via (userland) mmap() with the MAP_SHARED flag specified.
Both approaches ultimately use shmem_file_setup() to create a file, which simply creates a new struct dentry and struct inode, fills in the relevant details and instantiates them.
Because the file system is internal, the names of the files created do not have to be unique as the files are always located by inode, not name. As a result, shmem_zero_setup() always creates a file "dev/zero" which is how it shows up in /proc/<pid>/maps.
Files created by shmget() are named SYSV<NN> where <NN> is the key that is passed as a parameter to shmget().

12.8 System V IPC

To avoid spending too long on this, we focus only on shmget() and shmat() and how they are affected by the VM.
shmget() is implemented via sys_shmget(). It performs sanity checks, sets up the IPC-related data structures and a new segment via newseg() which creates the file in shmfs with shmem_file_setup() as previously discussed.
shmat() is implemented via sys_shmat(). It does sanity checks, acquires the appropriate descriptor and hands over to do_mmap() to map the shared region into the process address space.
sys_shmat() is responsible for ensuring that VMAs don't overlap if the caller erroneously specifies to do so.
The struct shmid_kernel->shm_nattch counter is maintained by shm_vm_ops, a struct vm_operations_struct that registers open() and close() callbacks of shm_open() and shm_close() respectively.
shm_close() is also responsible for destroying shared regions if the SHM_DEST flag is specified and the shm_nattch counter reaches zero.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

12.md

12.md

Chapter 12: Shared Memory Virtual Filesystem

12.1 Initialising the Virtual Filesystem

12.2 Using shmem Functions

12.3 Creating files in tmpfs

12.4 Page Faulting Within a Virtual File

12.4.1 Locating Swapped Pages

12.4.2 Writing Pages to Swap

12.5 File Operations in tmpfs

12.6 Inode Operations in tmpfs

12.7 Setting Up Shared Regions

12.8 System V IPC

Files

12.md

Latest commit

History

12.md

File metadata and controls

Chapter 12: Shared Memory Virtual Filesystem

12.1 Initialising the Virtual Filesystem

12.2 Using shmem Functions

12.3 Creating files in tmpfs

12.4 Page Faulting Within a Virtual File

12.4.1 Locating Swapped Pages

12.4.2 Writing Pages to Swap

12.5 File Operations in tmpfs

12.6 Inode Operations in tmpfs

12.7 Setting Up Shared Regions

12.8 System V IPC