sync() behavior in LTFS

Modern operating systems typically maintain an in-kernel cache of file-system metadata and small buffers of recent writes to open files. The file-system metadata typically includes data such as the filename, timestamps, and permissions for recently accessed files and directories. This metadata cache and the write buffers are periodically flushed to the storage media automatically by the system. This flush operation is commonly referred to as the sync() operation named after the user-space system call that exposes the functionality in UN*X-based systems such as Linux and Mac OS X. In Microsoft Windows this functionality is exposed by the FlushFileBuffers() call. sync() is so named because the system call “synchronizes” the file-system.

This blog entry outlines the sync() behavior in LTFS and the rational behind the associated design choices. In this post, comments that reference the sync() system call apply equally to the FlushFileBuffers() call in LTFS-SDE for Windows.

The Linear Tape File System – Single Drive Edition (LTFS-SDE) maintains the working copy of the Index for the LTFS Volume in main memory while the file-system is mounted. Separately, buffers are maintained for each open file. These file buffers are used to coalesce individual write calls into block-sized write operations to the media rather than performing many very small writes to the media. This write-caching dramatically increases write performance of the file-system. The metadata caching is briefly described in an earlier blog post. Write-buffer caching in LTFS-SDE will be discussed in a future post. This metadata cache (aka LTFS Index) and the write-buffers are at risk in the event of a system crash or power-loss.

In LTFS there is also the additional concern about frequency of sync() operations. A file-system on random access media such as flash or a hard-drive can update the on-media Index in-place. This means that frequent updates to the on-media Index will over-write older versions and therefore will usually occupy the same physical space on the media. With sequential media there is a real risk of consuming all of the media space with metadata Indexes leaving no room for data storage. Careful design of our sync() operations in LTFS has resulted in a balanced and safe range of sync() functionality that can be tweaked to meet differing user scenarios.

LTFS v1.0 and v1.1

The first two releases (v1.0 and v1.1) of the Linear Tape File System took a very simple approach to synchronizing the metadata cache and write-buffers. In these releases, the Index metadata is only flushed to the media is when a partition change occurs, or when the Volume is unmounted. In typical operation (without a data placement policy defined) there will never be a partition change to the Index Partition. In effect, an Index will only be flushed during unmount. (For the purpose of this blog post I will ignore data placement policies to simplify the discussion.)

Since LTFS v1.0 and v1.1 will only flush the Index during unmount, if the system loses power the user is exposed to potentially losing all of the data changes made to the Volume during the entire mount session. This data loss would include any files written to the LTFS Volume, any changes made to files already on the LTFS Volume, and any metadata updates such as file deletes, file renames, file moves, and timestamp updates.

Traditional file-systems

Traditional file-systems (NTFS, HFS+, ext3, etc) maintain a similar in-memory working cache of parts of the file-system Index and write-buffers. These caches and write-buffers help improve file-system performance. At any point with these traditional systems there is the potential for the same types of data loss described above if the in-memory file-system Index has not been flushed to the storage media. This is the primary reason for requiring an explicit unmount operation before disconnecting a hard drive from a running system. In Linux, the unmount operation is performed using the “umount” command. In Windows, the unmount operation is performed using the “Safely Eject Device” GUI action.

In modern systems a background kernel thread takes periodic action to reduce the this risk of data loss by triggering a flush of the in-memory file-system Index cache. This periodic action is typically performed every 5 seconds and flushes any changes to mounted file-systems. As a result, the window of risk with traditional file-systems is limited to the last 5 seconds of operation.

LTFS v1.2

With LTFS-SDE v1.2 we added two new sync() behaviors to help reduce the risk of data loss. These behaviors have different semantics and are appropriate in different user scenarios. The new behaviors are:

  • Time-based sync, and
  • Sync-on-close.

“Time-based sync” behaves as logically equivalent to the traditional operating system driven sync() operation. That is, the file-system flushes the in-memory Index to the media periodically at a defined time duration between each flush. When a Index flush is scheduled, the metadata is only written to disk if there has been some change since the preceding flush. The sequential nature of tape prompted me to set the default time period for time-based sync at 5 minutes. Thus the user is exposed to data loss for only the past 5 minutes of operation. On balance I felt that 5 minutes was a reasonable compromise between “wasting” a large amount of tape space for frequent Index updates versus the time period during which the user is exposed to data loss. The default time period can be overridden at mount time by users who have particular requirements.

“Sync-on-close” provides a sync() semantic where each time a file is closed the in-memory Index is flushed to the media if the Index has been updated. This ensures that each time the user finishes operating on a file all changes in the file-system are flushed to the media. Some metadata changes do not involve a close() operation. With these metadata operations there is potential risk to losing the result of these operations in the event of power-loss. Myself and the other members of the LTFS development team felt that the metadata operations that remain at risk are operations that can tolerate higher risk than file-content.

In LTFS-SDE v1.2 the user can specify “time-based sync” or “sync-on-close” at file-system mount time. The desired sync() behavior can also be configured system-wide in the ltfs configuration file.

If “time-based sync” or “sync-on-close” are enabled the file-system will still perform a sync() during unmount operations.

Manual sync()

LTFS-SDE v1.2 also exposes the ability for the user (or an application) to manually trigger a file system sync() by writing to the ltfs.sync extended attribute. If this sync() is triggered while there are open file handles there is the potential that the Index written will contain partially written files.

All versions of LTFS-SDE to date ignore sync() system calls from user-space. This is a limitation of the FUSE library. The FUSE developers are rightfully concerned about a FUSE-based file-system taking an arbitrarily long time to complete a sync() during which the whole system will be blocked. The ltfs.sync extended attribute and the time-based sync are approaches to provide traditional sync() functionality without FUSE support for the sync() system call.

Potential user concerns

In the situation where “time-based sync” is enabled and a file copy is performed that exceeds the sync timeout value an Index may be written to the media in the middle of the copy operation. Assume that Index generation 4 was written before the copy started and Index generation 5 is written in the middle of the file copy operation. Index generation 5 will contain an entry for the file that is in the middle of the copy operation and only the data that has landed on the media will be referenced in this Index. This means that if the file copy is 50% done when the sync() occurs then the Index will only reference the first half of the file being copied. If, at a later date, the user chooses to roll-back the Volume to Index generation 5 then the Volume will only contain the first half of the file after the roll-back. This may be perceived by the user as a data-loss scenario, but technically the roll-back has not resulted in a loss of data. Index generation 5 will correctly represent the state of the LTFS Volume at the point in time when the Index was written.

We explored ways to remove this perceived problem but none of the potential changes will eliminate this issue. In fact, this partial-write being perceived as data-loss exists with any sync() operation that is performed outside of the unmount processing. For example, consider a Volume that is mounted with only sync-on-close enabled and two files (A.txt and B.txt) are simultaneously copied to the filesystem from two different copy commands. If A.txt is significantly smaller than B.txt then the close() operation on A.txt will trigger a sync operation. Assume that this sync() operation produces Index generation 10 on the Volume. If a user then rolls back the volume to generation 10 they will find that only part of file B.txt exists on the Volume.

A potential solution would be to track open file handles that are open as a side-effect of a create() call. If these handles were tracked the file system could theoretically prevent a sync operation until these handles were closed. This approach would reduce the potential for partial files on roll-back. However, this approach would not cover cases where a copy operation is performed on a filename that already exists in the filesystem. In the case where the file already exists applications typically use “open()” followed by either a truncate or seek to the beginning of the file and just start overwriting the file contents. In these cases we still would get partial writes on the Volume if a roll-back is performed.

All of these partial write scenarios exist in traditional file-systems that operate on flash and hard drive media. However the scenarios are hidden on the traditional media because the Index is updated in place rather than recording a sequence of Index snapshots as happens with LTFS.

My team explored a number of potential changes to address the concern over partially written files appearing after a file-system roll-back. We did not find an approach to addressed all partially written file scenarios without introducing significant complexity into the file-system. My judgement was that the risk of added complexity did not justify the development effort.

In a future version of the LTFS Specification I hope to extend the LTFS Format to keep track of the number of open files when an Index is written to the media. If this open file count is written as part of the LTFS Index it could be exposed in the roll-back point list. This change would not remove the potential for partially written files but it would expose to the user the necessary information for them to understand which Index generations may have partial files. The user can then make an informed choice about which generation to select. I believe that this extension the best way to address concerns over partially written files.

This entry was posted in data storage, LTFS, sync, system call. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *