recentpopularlog in


« earlier   
Battle testing data integrity verification with ZFS, Btrfs and mdadm+dm-integrity
In this article I share the results of a home-lab experiment in which I threw some different problems at ZFS, Btrfs and mdadm+dm-integrity in a RAID-5 setup.
zfs  btrfs  linux  storage  myths  bsdnow  unix  bsd  raid  filesystem  choices  research  thanks 
14 days ago by xer0x
bertbaron/btrdedup: BTRFS Deduplication tool
BTRFS Deduplication tool

Deduplication tool like [bedup]( I
wrote it quite some time ago already because bedup had problems
with my volume and the number of snapshots (crashes, database
corruption etc.)

Btrdedup uses much less resources especially in case of many
snapshots. The limitation is that it only deduplicates files that
start with the same content. By inspecting the fragmentation
before offering the files for deduplication to the kernel (using
the btrfs deduplication ioctl) data that is already shared will
not be deduplicated again.

Btrdedup does not maintain state between runs. This makes it less
suitable for incremental deduplication. On the other hand it
makes the tool very robust and because of its efficiency in
detecting already deduplicated files it can easily be scheduled
to run once a month for example.


Download the latest release:


Make executable using: `chmod +x btrdedup`


Typically you want to run the program as root on the complete
mounted btrfs pool with a command like this:

``` {.shell}
nice -n 10 ./btrdedup /mnt 2>dedup.log


``` {.shell}
nice -n 10 ./btrdedup /mnt >dedup.out 2>dedup.log &

The scanning phase may still take a long time depending on the
number of files. The most expensive part however, the
deduplication itself, is only called when necessary.

Btrfdedup is very memory efficient and doesn't require a
database. It can be instructed to use even less memory by
providing the `-lowmem` option. This may require a few more
minutes, but it may also be faster because of reduced memory
management. Future versions might default to this option.

Use `btrdedup -h` for the full list of options.

Under the hood

Btrdedup works by first reading the file tree(s) in memory in an
efficient data structure. It then processes these files in three

- Pass 1: Read the fragmentation table for each file.

Sort the result on the offset of the first block

- Pass 2: Calculate the hash of the first block of each file.
Because the files are sorted on the first block offset, any
block is only loaded and hashed once.

Sort the result on the hash of the first block

- Pass 3: Files that have the first block in common are offered
for deduplication. The deduplication phase will first check
if blocks are already shared to only offer data for actual
deduplication if necessary.

In lowmem mode, the output of each pass is written to an encoded
temporary text file which is then sorted using the systems `sort`

Future improvements

The last pass still needs some improvents. Currently files with
the same hashcode for the first block are assumed to be equal to
the size of the smallest file. In the future the blocks should be
more thoroughly checked for duplicates, by comparing the hash
codes of all blocks.
btrfs  deduplication  language:go  linux  tools 
25 days ago by thedward
markfasheh/duperemove: Tools for deduping file systems
This README is for duperemove v0.11.


Duperemove is a simple tool for finding duplicated extents and
submitting them for deduplication. When given a list of files it
will hash their contents on a block by block basis and compare
those hashes to each other, finding and categorizing blocks that
match each other. When given the -d option, duperemove will
submit those extents for deduplication using the Linux kernel
extent-same ioctl.

Duperemove can store the hashes it computes in a 'hashfile'. If
given an existing hashfile, duperemove will only compute hashes
for those files which have changed since the last run. Thus you
can run duperemove repeatedly on your data as it changes, without
having to re-checksum unchanged data.

Duperemove can also take input from the
[fdupes]( program.

See [the duperemove man
page]( for
further details about running duperemove.


The latest stable code (v0.11) can be found in [the v0.11 branch

Kernel: Duperemove needs a kernel version equal to or greater
than 3.13

Libraries: Duperemove uses glib2 and sqlite3.


Please see the FAQ section in [the duperemove man

For bug reports and feature requests please use [the github issue


Please see the examples section of [the duperemove man
for a complete set of usage examples, including hashfile usage.

A simple example, with program output

Duperemove takes a list of files and directories to scan for
dedupe. If a directory is specified, all regular files within it
will be scanned. Duperemove can also be told to recursively scan
directories with the '-r' switch. If '-h' is provided, duperemove
will print numbers in powers of 1024 (e.g., "128K").

Assume this abitrary layout for the following examples.

├── dir1
│ ├── file3
│ ├── file4
│ └── subdir1
│ └── file5
├── file1
└── file2

This will dedupe files 'file1' and 'file2':

duperemove -dh file1 file2

This does the same but adds any files in dir1 (file3 and file4):

duperemove -dh file1 file2 dir1

This will dedupe exactly the same as above but will recursively
walk dir1, thus adding file5.

duperemove -dhr file1 file2 dir1/

An actual run, output will differ according to duperemove

Using 128K blocks
Using hash: murmur3
Using 4 threads for file hashing phase
csum: /btrfs/file1 [1/5] (20.00%)
csum: /btrfs/file2 [2/5] (40.00%)
csum: /btrfs/dir1/subdir1/file5 [3/5] (60.00%)
csum: /btrfs/dir1/file3 [4/5] (80.00%)
csum: /btrfs/dir1/file4 [5/5] (100.00%)
Total files: 5
Total hashes: 80
Loading only duplicated hashes from hashfile.
Hashing completed. Calculating duplicate extents - this may take some time.
Simple read and compare of file data found 3 instances of extents that might benefit from deduplication.
Showing 2 identical extents of length 512.0K with id 0971ffa6
Start Filename
512.0K "/btrfs/file1"
1.5M "/btrfs/dir1/file4"
Showing 2 identical extents of length 1.0M with id b34ffe8f
Start Filename
0.0 "/btrfs/dir1/file4"
0.0 "/btrfs/dir1/file3"
Showing 3 identical extents of length 1.5M with id f913dceb
Start Filename
0.0 "/btrfs/file2"
0.0 "/btrfs/dir1/file3"
0.0 "/btrfs/dir1/subdir1/file5"
Using 4 threads for dedupe phase
[0x147f4a0] Try to dedupe extents with id 0971ffa6
[0x147f770] Try to dedupe extents with id b34ffe8f
[0x147f680] Try to dedupe extents with id f913dceb
[0x147f4a0] Dedupe 1 extents (id: 0971ffa6) with target: (512.0K, 512.0K), "/btrfs/file1"
[0x147f770] Dedupe 1 extents (id: b34ffe8f) with target: (0.0, 1.0M), "/btrfs/dir1/file4"
[0x147f680] Dedupe 2 extents (id: f913dceb) with target: (0.0, 1.5M), "/btrfs/file2"
Kernel processed data (excludes target files): 4.5M
Comparison of extent info shows a net change in shared extents of: 5.5M

Links of interest

[The duperemove
wiki]( has both
design and performance documentation.

has a growing assortment of regression tests.

[Duperemove web page](
deduplication  btrfs  extent-same  xfs 
25 days ago by thedward
g2p/bedup: Btrfs deduplication
Deduplication for Btrfs.

bedup looks for new and changed files, making sure that multiple
copies of identical files share space on disk. It integrates
deeply with btrfs so that scans are incremental and low-impact.


You need Python 3.3 or newer, and Linux 3.3 or newer. Linux 3.9.4
or newer is recommended, because it fixes a scanning bug and is
compatible with cross-volume deduplication.

This should get you started on Ubuntu 16.04:

sudo aptitude install python3-pip python3-dev python3-cffi libffi-dev build-essential git

This should get you started on earlier versions of Debian/Ubuntu:

sudo aptitude install python3-pip python3-dev libffi-dev build-essential git

This should get you started on Fedora:

yum install python3-pip python3-devel libffi-devel gcc git


On systems other than Ubuntu 16.04 you need to install CFFI:

pip3 install --user cffi

Option 1 (recommended): from a git clone

Enable submodules (this will pull headers from btrfs-progs)

git submodule update --init

Complete the installation. This will compile some code with CFFI
and pull the rest of our Python dependencies:

python3 install --user
cp -lt ~/bin ~/.local/bin/bedup

Option 2: from a PyPI release

pip3 install --user bedup
cp -lt ~/bin ~/.local/bin/bedup


bedup --help
bedup <command> --help

On Debian and Fedora, you may need to use [sudo -E
\~/bin/bedup]{.title-ref} or install cffi and bedup as root
(bedup and its dependencies will get installed to /usr/local).

You\'ll see a list of supported commands.

- **scan** scans volumes to keep track of potentially
duplicated files.
- **dedup** runs scan, then deduplicates identical files.
- **show** shows btrfs filesystems and their tracking status.
- **dedup-files** takes a list of identical files and
deduplicates them.
- **find-new** reimplements the `btrfs subvolume find-new`
command with a few extra options.

To deduplicate all filesystems: :

sudo bedup dedup

Unmounted or read-only filesystems are excluded if they aren\'t
listed on the command line. Filesystems can be referenced by uuid
or by a path in /dev: :

sudo bedup dedup /dev/disks/by-label/Btrfs

Giving a subvolume path also works, and will include subvolumes
by default.

Since cross-subvolume deduplication requires Linux 3.6, users of
older kernels should use the `--no-crossvol` flag.


pip3 install --user pytest tox ipdb

To run the tests:

sudo python3 -m pytest -s bedup

To test compatibility and packaging as well:

GETROOT=/usr/bin/sudo tox

Run a style check on edited files:


Deduplication is implemented using a Btrfs feature that allows
for cloning data from one file to the other. The cloned ranges
become shared on disk, saving space.

File metadata isn\'t affected, and later changes to one file
won\'t affect the other (this is unlike hard-linking).

This approach doesn\'t require special kernel support, but it has
two downsides: locking has to be done in userspace, and there is
no way to free space within read-only (frozen) snapshots.


Scanning is done incrementally, the technique is similar to
`btrfs subvolume find-new`. You need an up-to-date kernel (3.10,
3.9.4,,,, 3.4.47) to index all files;
earlier releases have a bug that causes find-new to end
prematurely. The fix can also be cherry-picked from [this


Before cloning, we need to lock the files so that their contents
don\'t change from the time the data is compared to the time it
is cloned. Implementation note: This is done by setting the
immutable attribute on the file, scanning /proc to see if some
processes still have write access to the file (via preexisting
file descriptors or memory mappings), bailing if the file is in
write use. If all is well, the comparison and cloning steps can
proceed. The immutable attribute is then reverted.

This locking process might not be fool-proof in all cases; for
example a malicious application might manage to bypass it, which
would allow it to change the contents of files it doesn\'t have
access to.

There is also a small time window when an application will get
permission errors, if it tries to get write access to a file we
have already started to deduplicate.

Finally, a system crash at the wrong time could leave some files
immutable. They will be reported at the next run; fix them using
the `chattr -i` command.


The clone call is considered a write operation and won\'t work on
read-only snapshots.

Before Linux 3.6, the clone call didn\'t work across subvolumes.


Before Linux 3.9, defragmentation could break copy-on-write
sharing, which made it inadvisable when snapshots or
deduplication are used. Btrfs defragmentation has to be
explicitly requested (or background defragmentation enabled), so
this generally shouldn\'t be a problem for users who were unaware
of the feature.

Users of Linux 3.9 or newer can safely pass the
[\--defrag]{.title-ref} option to [bedup dedup]{.title-ref},
which will defragment files before deduplicating them.

Reporting bugs

Be sure to mention the following:

- Linux kernel version: uname -rv
- Python version
- Distribution

And give some of the program output.

Build status

btrfs  deduplication  language:python  linux  incremental 
25 days ago by thedward
Linux RAM and Disk Hacking with ZRAM and BTRFS ·
At a recent job, I faced a pretty bleak situation: my MacBook Pro had only 8 gigabytes of RAM and...
march 2019 by ianweatherhogg
How to Create and Manage Btrfs Snapshots and Rollbacks on Linux (part 2) | | The source for Linux information
In "How to Manage Btrfs Storage Pools, Subvolumes And Snapshots on Linux (part 1)" we learned how to create a nice little Btrfs test lab, and how to create a Btrfs storage volume. Now we're going to learn how to make live snapshots whenever we want, and how to roll the filesystem back to any point to any arbitrary point in time. This does not replace backups.
btrfs  linux 
march 2019 by frailty

Copy this bookmark:

to read