Skip to content

Commit

Permalink
misc document updates
Browse files Browse the repository at this point in the history
  • Loading branch information
trapexit committed Feb 1, 2017
1 parent c25974d commit 9cc9bb9
Show file tree
Hide file tree
Showing 3 changed files with 59 additions and 75 deletions.
39 changes: 23 additions & 16 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,16 +49,16 @@ The srcmounts (source mounts) argument is a colon (':') delimited list of paths
To make it easier to include multiple source mounts mergerfs supports [globbing](http://linux.die.net/man/7/glob). **The globbing tokens MUST be escaped when using via the shell else the shell itself will expand it.**

```
$ mergerfs -o defaults,allow_other /mnt/disk\*:/mnt/cdrom /media/drives
$ mergerfs -o defaults,allow_other,use_ino /mnt/disk\*:/mnt/cdrom /media/drives
```

The above line will use all mount points in /mnt prefixed with **disk** and the **cdrom**.

To have the pool mounted at boot or otherwise accessable from related tools use **/etc/fstab**.

```
# <file system> <mount point> <type> <options> <dump> <pass>
/mnt/disk*:/mnt/cdrom /media/drives fuse.mergerfs defaults,allow_other 0 0
# <file system> <mount point> <type> <options> <dump> <pass>
/mnt/disk*:/mnt/cdrom /media/drives fuse.mergerfs defaults,allow_other,use_ino 0 0
```

**NOTE:** the globbing is done at mount or xattr update time (see below). If a new directory is added matching the glob after the fact it will not be automatically included.
Expand Down Expand Up @@ -97,7 +97,7 @@ Due to FUSE limitations **ioctl** behaves differently if its acting on a directo
| ff (first found) | Given the order of the drives, as defined at mount time or when configured via xattr interface, act on the first one found. For **create** category it will exclude readonly drives and those with free space less than **minfreespace** (unless there is no other option). |
| lfs (least free space) | Pick the drive with the least available free space. For **create** category it will exclude readonly drives and those with free space less than **minfreespace**. Falls back to **mfs**. |
| lus (least used space) | Pick the drive with the least used space. For **create** category it will exclude readonly drives and those with free space less than **minfreespace**. Falls back to **mfs**. |
| mfs (most free space) | Pick the drive with the most available free space. For **create** category it will exclude readonly drives and those with free space less than **minfreespace**. Falls back to **ff**. |
| mfs (most free space) | Pick the drive with the most available free space. For **create** category it will exclude readonly drives. Falls back to **ff**. |
| newest (newest file) | Pick the file / directory with the largest mtime. For **create** category it will exclude readonly drives and those with free space less than **minfreespace** (unless there is no other option). |
| rand (random) | Calls **all** and then randomizes. |

Expand Down Expand Up @@ -310,7 +310,7 @@ A B C
* The recommended options are **defaults,allow_other,direct_io,use_ino**.
* Run mergerfs as `root` unless you're merging paths which are owned by the same user otherwise strange permission issues may arise.
* https://github.com/trapexit/backup-and-recovery-howtos : A set of guides / howtos on creating a data storage system, backing it up, maintaining it, and recovering from failure.
* If you don't see some directories / files you expect in a merged point be sure the user has permission to all the underlying directories. If `/drive0/a` has is owned by `root:root` with ACLs set to `0700` and `/drive1/a` is `root:root` and `0755` you'll see only `/drive1/a`. Use `mergerfs.fsck` to audit the drive for out of sync permissions.
* If you don't see some directories / files you expect in a merged point be sure the user has permission to all the underlying directories. Use `mergerfs.fsck` to audit the drive for out of sync permissions.
* Do *not* use `direct_io` if you expect applications (such as rtorrent) to [mmap](http://linux.die.net/man/2/mmap) files. It is not currently supported in FUSE w/ `direct_io` enabled.
* Since POSIX gives you only error or success on calls its difficult to determine the proper behavior when applying the behavior to multiple targets. **mergerfs** will return an error only if all attempts of an action fail. Any success will lead to a success returned. This means however that some odd situations may arise.
* Remember that some policies mixed with some functions may result in strange behaviors. Not that some of these behaviors and race conditions couldn't happen outside **mergerfs** but that they are far more likely to occur on account of attempt to merge together multiple sources of data which could be out of sync due to the different policies.
Expand All @@ -325,6 +325,12 @@ Use the `direct_io` option as described above. Due to what mergerfs is doing the

Since enabling `direct_io` disables `mmap` this is not an ideal situation however write speeds should be increased and there are some tweaks being developed which may help in minimizing the extra caching.

#### NFS clients don't work

Some NFS clients appear to fail when a mergerfs mount is exported. Kodi in particular seems to have issues.

Try enabling the `use_ino` option. Some have reported that it fixes the issue.

#### rtorrent fails with ENODEV (No such device)

Be sure to turn off `direct_io`. rtorrent and some other applications use [mmap](http://linux.die.net/man/2/mmap) to read and write to files and offer no failback to traditional methods. FUSE does not currently support mmap while using `direct_io`. There will be a performance penalty on writes with `direct_io` off as well as the problem of double caching but it's the only way to get such applications to work. If the performance loss is too high for other apps you can mount mergerfs twice. Once with `direct_io` enabled and one without it.
Expand Down Expand Up @@ -431,33 +437,34 @@ Yes. It will be represented immediately in the pool as the policies would descri

Please reread the sections above about policies, path preserving, and the **moveonenospc** option. If the policy is path preserving and a drive is almost full and the drive the policy would pick then the writing of the file may fill the drive and receive ENOSPC errors. That is expected with those settings. If you don't want that: enable **moveonenospc** and don't use a path preserving policy.

#### How are inodes calculated?
#### Can mergerfs mounts be exported over NFS?

mergerfs-inode = (original-inode | (device-id << 32))
Yes. Some clients (Kodi) have issues but users have found that enabling the `use_ino` option often address the problem.

While `ino_t` is 64 bits only a few filesystems use more than 32. Similarly, while `dev_t` is also 64 bits it was traditionally 16 bits. Bitwise or'ing them together should work most of the time. While totally unique inodes are preferred the overhead which would be needed does not seem to outweighted by the benefits.
#### Can mergerfs mounts be exported over Samba / SMB?

#### It's mentioned that there are some security issues with mhddfs. What are they? How does mergerfs address them?
Yes.

[mhddfs](https://github.com/trapexit/mhddfs) tries to handle being run as **root** by calling [getuid()](https://github.com/trapexit/mhddfs/blob/cae96e6251dd91e2bdc24800b4a18a74044f6672/src/main.c#L319) and if it returns **0** then it will [chown](http://linux.die.net/man/1/chown) the file. Not only is that a race condition but it doesn't handle many other situations. Rather than attempting to simulate POSIX ACL behaviors the proper behavior is to use [seteuid](http://linux.die.net/man/2/seteuid) and [setegid](http://linux.die.net/man/2/setegid), become the user making the original call and perform the action as them. This is how [mergerfs](https://github.com/trapexit/mergerfs) handles things.
#### How are inodes calculated?

If you are familiar with POSIX standards you'll know that this behavior poses a problem. **seteuid** and **setegid** affect the whole process and **libfuse** is multithreaded by default. We'd need to lock access to **seteuid** and **setegid** with a mutex so that the several threads aren't stepping on one another and files end up with weird permissions and ownership. However, with lots of calls the contention on that mutex would be extremely high. Thankfully on Linux and macOS there is a better solution.
mergerfs-inode = (original-inode | (device-id << 32))

macOS has a [non-portable pthread extension](https://developer.apple.com/library/mac/documentation/Darwin/Reference/ManPages/man2/pthread_setugid_np.2.html) for per-thread user and group impersonation.
While `ino_t` is 64 bits only a few filesystems use more than 32. Similarly, while `dev_t` is also 64 bits it was traditionally 16 bits. Bitwise or'ing them together should work most of the time. While totally unique inodes are preferred the overhead which would be needed does not seem to outweighted by the benefits.

Linux does not support [pthread_setugid_np](https://developer.apple.com/library/mac/documentation/Darwin/Reference/ManPages/man2/pthread_setugid_np.2.html) but user and group IDs are a per-thread attribute though documentation on that fact or how to manipulate them is not well distributed. From the **4.00** release of the Linux man-pages project for [setuid](http://man7.org/linux/man-pages/man2/setuid.2.html).
#### It's mentioned that there are some security issues with mhddfs. What are they? How does mergerfs address them?

> At the kernel level, user IDs and group IDs are a per-thread attribute. However, POSIX requires that all threads in a process share the same credentials. The NPTL threading implementation handles the POSIX requirements by providing wrapper functions for the various system calls that change process UIDs and GIDs. These wrapper functions (including the one for setuid()) employ a signal-based technique to ensure that when one thread changes credentials, all of the other threads in the process also change their credentials. For details, see nptl(7).
[mhddfs](https://github.com/trapexit/mhddfs) manages running as **root** by calling [getuid()](https://github.com/trapexit/mhddfs/blob/cae96e6251dd91e2bdc24800b4a18a74044f6672/src/main.c#L319) and if it returns **0** then it will [chown](http://linux.die.net/man/1/chown) the file. Not only is that a race condition but it doesn't handle many other situations. Rather than attempting to simulate POSIX ACL behavior the proper way to manage this is to use [seteuid](http://linux.die.net/man/2/seteuid) and [setegid](http://linux.die.net/man/2/setegid), in effect becoming the user making the original call, and perform the action as them. This is what mergerfs does.

As it turns out the setreuid syscalls apply only to the thread. GLIBC hides this away using RT signals to inform all threads to change credentials. Taking after **Samba**, mergerfs uses **syscall(SYS_setreuid,...)** to set the callers credentials for that thread only. Jumping back to **root** as necessary should escalated privileges be needed (for instance: to clone paths between drives).
In Linux setreuid syscalls apply only to the thread. GLIBC hides this away by using realtime signals to inform all threads to change credentials. Taking after **Samba**, mergerfs uses **syscall(SYS_setreuid,...)** to set the callers credentials for that thread only. Jumping back to **root** as necessary should escalated privileges be needed (for instance: to clone paths between drives).

For non-Linux systems mergerfs uses a read-write lock and changes credentials only when necessary. If multiple threads are to be user X then only the first one will need to change the processes credentials. So long as the other threads need to be user X they will take a readlock allow multiple threads to share the credentials. Once a request comes in to run as user Y that thread will attempt a write lock and change to Y's credentials when it can. If the ability to give writers priority is supported then that flag will be used so threads trying to change credentials don't starve. This isn't the best solution but should work reasonably well. As new platforms are supported if they offer per thread credentials those APIs will be adopted.
For non-Linux systems mergerfs uses a read-write lock and changes credentials only when necessary. If multiple threads are to be user X then only the first one will need to change the processes credentials. So long as the other threads need to be user X they will take a readlock allowing multiple threads to share the credentials. Once a request comes in to run as user Y that thread will attempt a write lock and change to Y's credentials when it can. If the ability to give writers priority is supported then that flag will be used so threads trying to change credentials don't starve. This isn't the best solution but should work reasonably well assuming there are few users.

# SUPPORT

#### Issues with the software
* github.com: https://github.com/trapexit/mergerfs/issues
* email: trapexit@spawn.link
* twitter: https://twitter.com/_trapexit

#### Support development
* Gratipay: https://gratipay.com/~trapexit
Expand Down
93 changes: 34 additions & 59 deletions man/mergerfs.1
Original file line number Diff line number Diff line change
Expand Up @@ -104,7 +104,7 @@ the shell itself will expand it.\f[]
.IP
.nf
\f[C]
$\ mergerfs\ \-o\ defaults,allow_other\ /mnt/disk\\*:/mnt/cdrom\ /media/drives
$\ mergerfs\ \-o\ defaults,allow_other,use_ino\ /mnt/disk\\*:/mnt/cdrom\ /media/drives
\f[]
.fi
.PP
Expand All @@ -116,8 +116,8 @@ tools use \f[B]/etc/fstab\f[].
.IP
.nf
\f[C]
#\ <file\ system>\ \ \ \ \ \ \ \ <mount\ point>\ \ <type>\ \ \ \ \ \ \ \ \ <options>\ \ \ \ \ \ \ \ \ \ \ \ \ <dump>\ \ <pass>
/mnt/disk*:/mnt/cdrom\ \ /media/drives\ \ fuse.mergerfs\ \ defaults,allow_other\ \ 0\ \ \ \ \ \ \ 0
#\ <file\ system>\ \ \ \ \ \ \ \ <mount\ point>\ \ <type>\ \ \ \ \ \ \ \ \ <options>\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ <dump>\ \ <pass>
/mnt/disk*:/mnt/cdrom\ \ /media/drives\ \ fuse.mergerfs\ \ defaults,allow_other,use_ino\ \ 0\ \ \ \ \ \ \ 0
\f[]
.fi
.PP
Expand Down Expand Up @@ -302,8 +302,7 @@ T{
mfs (most free space)
T}@T{
Pick the drive with the most available free space.
For \f[B]create\f[] category it will exclude readonly drives and those
with free space less than \f[B]minfreespace\f[].
For \f[B]create\f[] category it will exclude readonly drives.
Falls back to \f[B]ff\f[].
T}
T{
Expand Down Expand Up @@ -705,9 +704,6 @@ maintaining it, and recovering from failure.
.IP \[bu] 2
If you don\[aq]t see some directories / files you expect in a merged
point be sure the user has permission to all the underlying directories.
If \f[C]/drive0/a\f[] has is owned by \f[C]root:root\f[] with ACLs set
to \f[C]0700\f[] and \f[C]/drive1/a\f[] is \f[C]root:root\f[] and
\f[C]0755\f[] you\[aq]ll see only \f[C]/drive1/a\f[].
Use \f[C]mergerfs.fsck\f[] to audit the drive for out of sync
permissions.
.IP \[bu] 2
Expand Down Expand Up @@ -766,6 +762,13 @@ Since enabling \f[C]direct_io\f[] disables \f[C]mmap\f[] this is not an
ideal situation however write speeds should be increased and there are
some tweaks being developed which may help in minimizing the extra
caching.
.SS NFS clients don\[aq]t work
.PP
Some NFS clients appear to fail when a mergerfs mount is exported.
Kodi in particular seems to have issues.
.PP
Try enabling the \f[C]use_ino\f[] option.
Some have reported that it fixes the issue.
.SS rtorrent fails with ENODEV (No such device)
.PP
Be sure to turn off \f[C]direct_io\f[].
Expand Down Expand Up @@ -984,6 +987,14 @@ drive and receive ENOSPC errors.
That is expected with those settings.
If you don\[aq]t want that: enable \f[B]moveonenospc\f[] and don\[aq]t
use a path preserving policy.
.SS Can mergerfs mounts be exported over NFS?
.PP
Yes.
Some clients (Kodi) have issues but users have found that enabling the
\f[C]use_ino\f[] option often address the problem.
.SS Can mergerfs mounts be exported over Samba / SMB?
.PP
Yes.
.SS How are inodes calculated?
.PP
mergerfs\-inode = (original\-inode | (device\-id << 32))
Expand All @@ -997,59 +1008,22 @@ needed does not seem to outweighted by the benefits.
.SS It\[aq]s mentioned that there are some security issues with mhddfs.
What are they? How does mergerfs address them?
.PP
mhddfs (https://github.com/trapexit/mhddfs) tries to handle being run as
mhddfs (https://github.com/trapexit/mhddfs) manages running as
\f[B]root\f[] by calling
getuid() (https://github.com/trapexit/mhddfs/blob/cae96e6251dd91e2bdc24800b4a18a74044f6672/src/main.c#L319)
and if it returns \f[B]0\f[] then it will
chown (http://linux.die.net/man/1/chown) the file.
Not only is that a race condition but it doesn\[aq]t handle many other
situations.
Rather than attempting to simulate POSIX ACL behaviors the proper
behavior is to use seteuid (http://linux.die.net/man/2/seteuid) and
setegid (http://linux.die.net/man/2/setegid), become the user making the
original call and perform the action as them.
This is how mergerfs (https://github.com/trapexit/mergerfs) handles
things.
.PP
If you are familiar with POSIX standards you\[aq]ll know that this
behavior poses a problem.
\f[B]seteuid\f[] and \f[B]setegid\f[] affect the whole process and
\f[B]libfuse\f[] is multithreaded by default.
We\[aq]d need to lock access to \f[B]seteuid\f[] and \f[B]setegid\f[]
with a mutex so that the several threads aren\[aq]t stepping on one
another and files end up with weird permissions and ownership.
However, with lots of calls the contention on that mutex would be
extremely high.
Thankfully on Linux and macOS there is a better solution.
.PP
macOS has a non\-portable pthread
extension (https://developer.apple.com/library/mac/documentation/Darwin/Reference/ManPages/man2/pthread_setugid_np.2.html)
for per\-thread user and group impersonation.
.PP
Linux does not support
pthread_setugid_np (https://developer.apple.com/library/mac/documentation/Darwin/Reference/ManPages/man2/pthread_setugid_np.2.html)
but user and group IDs are a per\-thread attribute though documentation
on that fact or how to manipulate them is not well distributed.
From the \f[B]4.00\f[] release of the Linux man\-pages project for
setuid (http://man7.org/linux/man-pages/man2/setuid.2.html).
.RS
.PP
At the kernel level, user IDs and group IDs are a per\-thread attribute.
However, POSIX requires that all threads in a process share the same
credentials.
The NPTL threading implementation handles the POSIX requirements by
providing wrapper functions for the various system calls that change
process UIDs and GIDs.
These wrapper functions (including the one for setuid()) employ a
signal\-based technique to ensure that when one thread changes
credentials, all of the other threads in the process also change their
credentials.
For details, see nptl(7).
.RE
.PP
As it turns out the setreuid syscalls apply only to the thread.
GLIBC hides this away using RT signals to inform all threads to change
credentials.
Rather than attempting to simulate POSIX ACL behavior the proper way to
manage this is to use seteuid (http://linux.die.net/man/2/seteuid) and
setegid (http://linux.die.net/man/2/setegid), in effect becoming the
user making the original call, and perform the action as them.
This is what mergerfs does.
.PP
In Linux setreuid syscalls apply only to the thread.
GLIBC hides this away by using realtime signals to inform all threads to
change credentials.
Taking after \f[B]Samba\f[], mergerfs uses
\f[B]syscall(SYS_setreuid,...)\f[] to set the callers credentials for
that thread only.
Expand All @@ -1061,20 +1035,21 @@ credentials only when necessary.
If multiple threads are to be user X then only the first one will need
to change the processes credentials.
So long as the other threads need to be user X they will take a readlock
allow multiple threads to share the credentials.
allowing multiple threads to share the credentials.
Once a request comes in to run as user Y that thread will attempt a
write lock and change to Y\[aq]s credentials when it can.
If the ability to give writers priority is supported then that flag will
be used so threads trying to change credentials don\[aq]t starve.
This isn\[aq]t the best solution but should work reasonably well.
As new platforms are supported if they offer per thread credentials
those APIs will be adopted.
This isn\[aq]t the best solution but should work reasonably well
assuming there are few users.
.SH SUPPORT
.SS Issues with the software
.IP \[bu] 2
github.com: https://github.com/trapexit/mergerfs/issues
.IP \[bu] 2
email: trapexit\@spawn.link
.IP \[bu] 2
twitter: https://twitter.com/_trapexit
.SS Support development
.IP \[bu] 2
Gratipay: https://gratipay.com/~trapexit
Expand Down
Loading

0 comments on commit 9cc9bb9

Please sign in to comment.