-
Sharing a region of memory backed by a file or device is simply a case of calling mmap() with the
MAP_SHARED
flag. -
Regardless, there are 2 important cases where an anonymous region needs to be shared between processes:
-
When an
mmap()
withMAP_SHARED
is used without file backing - these regions will be shared between a parent and a child process after a fork() is executed. -
A region explicitly sets up an anonymous mapping via (userland calls) shmget() and attaches to the virtual address space with shmat().
-
When pages within a VMA are backed by a file on disk, the interface used is fairly straightforward. To read a page during a page fault, the required
nopage()
function is found in the struct vm_area_struct->vm_ops
field and the requiredwritepage()
function is found in the struct address_space_operations usinginode->i_mapping->a_ops
or alternativelypage->mapping->a_ops
. -
When normal file operations are taking place such as
mmap()
, read() and write(), the struct file_operations with the appropriate functions is found usinginode->i_fop
. The diagram shown in 4.2 goes into more detail. -
The interface is nice and clean and is conceptually easy to understand, but it does not help anonymous pages as there is no file backing there.
-
To keep this nice interface, linux creates an artificial file backing for anonymous pages using a RAM-based filesystem where each VMA is backed by a file in this filesystem.
-
Every inode in the artificial filesystem is placed on a linked list, shmem_inodes, so they can be easily located.
-
Overall this approach allows the same file-based interface to be used without treating anonymous pages as a special case.
-
The filesystem comes in two variations in 2.4.22 -
shm
andtmpfs
. They share core functionality and mainly different in what they're used for -shm
is used by the kernel for creating file backings for anonymous pages and for backing regions created by shmget(), the filesystem is mounted by kern_mount() so it's mounted internally and isn't visible to users. -
tmpfs
is a temporary file system that is often mounted at/tmp
publicly as as fast RAM-based temp fs. A secondary use for tmpfs is to mount/tmp/shm
which allows mmap()-ing of files there in order to share information as a form of IPC. Regardless of the type of use, a sysadmin has to explicitly mount the tmpfs.
- The virtual filesystem is initialised by init_tmpfs() either during system start or when the module is being loaded. This does the following:
-
Mounts
tmpfs
andshm
, withshm
mounted internally via kern_mount(). -
Calculates the maximum number of blocks and inodes that can exist in the filesystem.
-
As part of the registration, the function shmem_read_super() is used as a callback to populate a struct super_block with more information about the filesystems, such as making the block size equal to the page size.
-
Every inode created in the filesystem will have a struct shmem_inode_info associated with it containing private information specific to the filesystem.
-
SHMEM_I() takes an inode as a parameter and returns a pointer to a
struct shmem_inode_info
, bridging the gap between the two. Taking a look at the latter:
struct shmem_inode_info {
spinlock_t lock;
unsigned long next_index;
swp_entry_t i_direct[SHMEM_NR_DIRECT]; /* for the first blocks */
void **i_indirect; /* indirect blocks */
unsigned long swapped; /* data pages assigned to swap */
unsigned long flags;
struct list_head list;
struct inode *inode;
};
- Considered each field:
-
lock
- Spinlock protecting the inode information from concurrent accesses. -
next_index
- Index of the last page being used in the file. This will be different frominode->i_size
when a file is being truncated. -
i_direct
- direct block containing the first SHMEM_NR_DIRECT swap vectors in use by the file (see 12.4.1.) -
i_indirect
- Pointer to the first indirect block (see 12.4.1.) -
swapped
- Count of the number of pages belonging to the file that are currently swapped out. -
flags
- Currently only used to determine if the file belongs to shared region set by shmget(). It is set by specifyingSHM_LOCK
when callingshmget()
and unlocked viaSHM_UNLOCK
. -
list
- List of all inodes used by the filesystem. -
inode
- Pointer to the parent inode.
-
Different structs contain pointers for
shmem
-specific functions. Regardless, in all casestmpfs
andshm
share the same structs. -
For faulting in pages and writing them to backing storage, two structs - a struct address_space_operations, shmem_aops and a struct vm_operations_struct, shmem_vm_ops.
-
Taking a look at the address space operations struct
shmem_aops
:
static struct address_space_operations shmem_aops = {
removepage: shmem_removepage,
writepage: shmem_writepage,
#ifdef CONFIG_TMPFS
readpage: shmem_readpage,
prepare_write: shmem_prepare_write,
commit_write: shmem_commit_write,
#endif
};
-
The most important function referenced here is shmem_writepage(), which is called when a page is moved from the page cache to the swap cache.
-
shmem_removepage() is called when apage is removed from the page cache so that the block can be reclaimed.
-
shmem_readpage() is not used by
tmpfs
, but is provided so the sendfile() system call can be used with tmpfs files. -
Similarly, shmem_prepare_write() and shmem_commit_write() are also unused, but provided so that
tmpfs
can be used with with the loopback device. -
Taking a look at struct shmem_vm_ops:
static struct vm_operations_struct shmem_vm_ops = {
nopage: shmem_nopage,
};
-
shmem_vm_ops
is used by anonymous VMAs as their struct vm_operations_struct so that shmem_nopage() is called when a new page is faulted in. -
To perform operations on files and inodes a struct file_operations and a struct inode_operations are required. The former is provided as shmem_file_operations:
static struct file_operations shmem_file_operations = {
mmap: shmem_mmap,
#ifdef CONFIG_TMPFS
read: shmem_file_read,
write: shmem_file_write,
fsync: shmem_sync_file,
#endif
};
-
Three sets of
struct inode_operations
are provided - shmem_inode_operations, shmem_dir_inode_operations, and a related pair - shmem_symlink_inline_operations shmem_symlink_inode_operations for handling file inodes, directory inodes and symlink inodes respectively. -
The two file operations supported are
truncate()
andsetattr()
:
static struct inode_operations shmem_inode_operations = {
truncate: shmem_truncate,
setattr: shmem_notify_change,
};
-
shmem_truncate() is used to truncate a file. since shmem_notify_change() is called when file attributes change it's possible (amongst other things) for a file to be grown with
truncate()
and to use the global zero page as the data page. -
The directory struct inode_operations provides a number of functions:
static struct inode_operations shmem_dir_inode_operations = {
#ifdef CONFIG_TMPFS
create: shmem_create,
lookup: shmem_lookup,
link: shmem_link,
unlink: shmem_unlink,
symlink: shmem_symlink,
mkdir: shmem_mkdir,
rmdir: shmem_rmdir,
mknod: shmem_mknod,
rename: shmem_rename,
#endif
};
- Finally we have the pair of symlink
struct inode_operations
s:
static struct inode_operations shmem_symlink_inline_operations = {
readlink: shmem_readlink_inline,
follow_link: shmem_follow_link_inline,
};
static struct inode_operations shmem_symlink_inode_operations = {
truncate: shmem_truncate,
readlink: shmem_readlink,
follow_link: shmem_follow_link,
};
-
The difference between the two
readlink()
andfollow_link()
functions is related to where the information is stored. A symlink inode does not require the private inode information stored in a struct shmem_inode_info. -
If the length of the symbolic link name is smaller than a
struct shmem_inode_info
, the space in the inode is used to store the name and shmem_symlink_inline_operations becomes the relevant struct inode_operations struct. -
Otherwise, a page is allocated via shmem_getpage(), the symbolic link is copied to it and shmem_symlink_inode_operations is used.
-
The inode symlink struct contains a
truncate()
function so the page will be reclaimed when the file is deleted. -
These structs ensure that the shmem equivalent of inode-related operations will be used when regions are backed by virtual files. The majority of the VM sees no difference between pages backed by a real file and those backed by virtual files.
-
Because
tmpfs
is mounted as a proper filesystem that is visible to the user it must support directory inode operations, as provided for by shmem_dir_inode_operations, discussed in the previous section. -
The implementation of most of the functions used by
shmem_dir_inode_operations
are rather small but they are all very interconnected. -
The majority of the inode fields provided by the VFS are filled by shmem_get_inode().
-
When creating a new file shmem_create() is used - this is a small wrapper around shmem_mknod() which is called with the
S_IFREG
flag set so a regular file will be created. -
Again
shmem_mknod()
is really just a wrapper aroundshmem_get_inode()
which creates an inode and fills in the struct inode fields as required. -
The 3 most interesting fields that are populated are
inode->i_mapping->a_ops
,inode->i_op
andinode->i_fop
fields. -
After the inode has been created,
shmem_mknod()
updates the directory inode'ssize
andmtime
statistics before creating the new inode. -
Files are created differently in
shm
despite the fact the filesystems are basically the same - this is looked at in more detail in 12.7.
-
When a page fault occurs, do_no_page() will invoke struct vm_area_struct
->vm_ops->nopage
if it exists. In the case of the virtual file filesystem, this means the function shmem_nopage() will be called. -
The core of
shmem_nopage()
is shmem_getpage() which is responsible for either allocating a new page or finding it in swap. -
This overloading of fault types is unusual because do_swap_page() is usually responsible for locating pages that have been moved to the swap cache or backing storage, using information encoded within the relevant PTE.
-
In the case of a virtual file, the relevant PTEs are set to 0 when they are moved to the swap cache. The inode's private filesystem data stores direct and indirect block information which is used to locate the pages later.
-
Other than this, the overall operation is very similar to standard page faulting.
- When a page has been swapped out, a struct swp_entry_t will contain the information needed to locate the page again - instead of using PTEs to find it, we store this in the inode's filesystem-specific private information (LS - there doesn't appear to be a private data field for an inode, perhaps this is stored elsewhere?)
shmem_swp_alloc
-
When faulting, shmem_swp_alloc() is used to locate the swap entry (LS - couldn't be certain, book refers to
shmem_alloc_entry()
which doesn't exist, this seem most likely to be the correct function.) -
Its basic task is to do some sanity checks (ensuring
shmem_inode_info->next_index
always points to the page index at the end of the virtual file) before calling shmem_swp_entry() to search for the swap vector within the inode information, and if we can't find it allocate a new one. -
The first SHMEM_NR_DIRECT entries are stored in struct inode
->i_direct
. This means that for i386, files smaller than 64KiB (SHMEM_NR_DIRECT * PAGE_SIZE
) will not need to use indirect blocks. Larger files must use indirect blocks starting withinode->i_indirect
. -
The initial indirect block (
inode->i_indirect
) is broken into two halves - the first containing pointers to 'doubly-indirect' blocks, and the second half containing pointers to 'triply-indirect' blocks. -
The doubly-indirect blocks are pages containing swap vectors (struct swp_entry_t), and the triply-indirect blocks contain pointers to pages, which in turn are filled with swap vectors.
-
This relationship means that the maximum number of pages in a virtual file (SHMEM_MAX_INDEX) is defined as follows:
#define SHMEM_MAX_INDEX (SHMEM_NR_DIRECT + (ENTRIES_PER_PAGEPAGE/2) * (ENTRIES_PER_PAGE+1))
- Looking at how indirect blocks are traversed, diagrammatically:
double-
indirect block
pages
/--->--------------- /------>---------------
| | swp_entry_t | | | swp_entry_t |
| |-- - -- - -- | | |-- - -- - -- |
| | swp_entry_t | | | swp_entry_t |
| |-- - -- - -- | | |-- - -- - -- |
| | swp_entry_t | | | swp_entry_t |
--------------------- | |-- - -- - -- | | |-- - -- - -- |
| inode->i_indirect | | | swp_entry_t | | | swp_entry_t |
--------------------- | --------------- | ---------------
| | |
| | /->--------------- | /---->---------------
v indirect block page | | | swp_entry_t | | | | swp_entry_t |
---------------------- | | |-- - -- - -- | | | |-- - -- - -- |
^ | ---/ | | swp_entry_t | | | | swp_entry_t |
ENTRIES_PER_PAGE/2 | |-- - -- - -- - -- - | | |-- - -- - -- | | | |-- - -- - -- |
v | -----/ | swp_entry_t | | | | swp_entry_t |
|--------------------| |-- - -- - -- | | | |-- - -- - -- |
^ | -----\ | swp_entry_t | | | | swp_entry_t |
ENTRIES_PER_PAGE/2 | |-- - -- - -- - -- - | | --------------- | | ---------------
v | ---\ | | |
---------------------- | | triple- | | /->---------------
| | indirect block | | | | swp_entry_t |
| | pages | | | |-- - -- - -- |
| \->--------------- | | | | swp_entry_t |
| | ---/ | | |-- - -- - -- |
| |-- - -- - -- | | | | swp_entry_t |
| | -----/ | |-- - -- - -- |
| |-- - -- - -- | | | swp_entry_t |
| | --------/ ---------------
| |-- - -- - -- |
| | ---------->---------------
| --------------- | swp_entry_t |
| |-- - -- - -- |
\--->--------------- | swp_entry_t |
| ---> |-- - -- - -- |
|-- - -- - -- | | swp_entry_t |
| ---> |-- - -- - -- |
|-- - -- - -- | | swp_entry_t |
| ---> ---------------
|-- - -- - -- |
| --->
---------------
- The function shmem_writepage() is the function registered in the filesystem's struct address_space_operations for writing pages to the swap. It's responsible for simply moving the page from the page cache to the swap cache, implemented as follows:
-
Record the current struct page
->mapping
and information about the inode. -
Allocate a free slot in the backing storage with get_swap_page().
-
Allocate a struct swp_entry_t via shmem_swp_entry().
-
Remove the page from the page cache.
-
Add the page to the swap cache. If it fails, free the swap slot add back the page cache, and try again.
-
Four operations are supported with virtual files -
mmap()
,read()
,write()
andfsync()
as stored in shmem_file_operations (discussed in 12.2.) -
There's nothing too unusual here:
-
mmap()
is implemented by shmem_mmap(), and it simply updates the VMA that is managing the mapped region. -
read()
is implemented by shmem_file_read() which performs the operation of copying bytes from the virtual file to a userspace buffer, faulting in pages as necessary. -
write()
is implemented by shmem_file_write() and is essentially the same deal asread()
. -
fsync()
is implemented by shmem_sync_file() but is essentially a NULL operation because it doesn't do anything and simply returns 0 for success - because the files only exist in RAM, they don't need to be synchronised with any disk.
- The most complex operation that is supported for inodes is truncation and involves four distinct stages:
-
shmem_truncate() - truncates a partial page at the end of the file and continually calls shmem_truncate_indirect() until the file is truncated to the proper size. Each call to
shmem_truncate_indirect()
will only process one indirect block at each pass, which is why it may need to be called multiple times. -
shmem_truncate_indirect() - Deals with both doubly and triply-indrect blocks. It finds the next indirect block that needs to be truncated to pass to stage 3, and will contain pointers to pages which will in turn contain swap vectors.
-
shmem_truncate_direct() - Selects a range of swap vectors from a page that need to be truncated and passes that range to shmem_free_swp().
-
shmem_free_swp() - Frees entries via free_swap_and_cache() which frees both the swap entry and the page containing data.
-
The linking and unlinking of files is very simple because most of the work is performed by the filesystem layer.
-
To link a file, the directory inode size is incremented, and
ctime
andmtime
of the affected inodes are updated and the number of links to the inode being linked is incremented. A reference to the new struct dentry is then created via dget() and d_instantiate(). -
To unlink a file the same inode statistics are updated before decrementing the references to the
struct dentry
only using dput() (and subsequently iput()) which will clear up the inode when its reference count hits zero. -
Creating a directory will use shmem_mkdir() which in turn uses shmem_mknod() with the
S_IFDIR
flag set, before incrementing the parent directory struct inode'si_nlink
counter. -
The function shmem_rmdir() will delete a directory, first making sure it's empty via shmem_empty(). If it is, the function then decrements the parent directory
struct inode->i_nlink
count and calls shmem_unlink() to remove it.
-
A shared region is backed by a file created in
shm
. -
There are two cases where a new file will be created - during the setup of a shared region via (userland) shmget() or when an anonymous region is set up via (userland) mmap() with the
MAP_SHARED
flag specified. -
Both approaches ultimately use shmem_file_setup() to create a file, which simply creates a new struct dentry and struct inode, fills in the relevant details and instantiates them.
-
Because the file system is internal, the names of the files created do not have to be unique as the files are always located by inode, not name. As a result, shmem_zero_setup() always creates a file
"dev/zero"
which is how it shows up in/proc/<pid>/maps
. -
Files created by
shmget()
are namedSYSV<NN>
where<NN>
is the key that is passed as a parameter toshmget()
.
-
To avoid spending too long on this, we focus only on shmget() and shmat() and how they are affected by the VM.
-
shmget()
is implemented via sys_shmget(). It performs sanity checks, sets up the IPC-related data structures and a new segment via newseg() which creates the file inshmfs
with shmem_file_setup() as previously discussed. -
shmat()
is implemented via sys_shmat(). It does sanity checks, acquires the appropriate descriptor and hands over to do_mmap() to map the shared region into the process address space. -
sys_shmat()
is responsible for ensuring that VMAs don't overlap if the caller erroneously specifies to do so. -
The struct shmid_kernel
->shm_nattch
counter is maintained by shm_vm_ops, a struct vm_operations_struct that registersopen()
andclose()
callbacks of shm_open() and shm_close() respectively. -
shm_close()
is also responsible for destroying shared regions if theSHM_DEST
flag is specified and theshm_nattch
counter reaches zero.