Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
any lessfs users out there?
View unanswered posts
View posts from last 24 hours

Goto page Previous  1, 2, 3  Next  
Reply to topic    Gentoo Forums Forum Index Unsupported Software
View previous topic :: View next topic  
Author Message
devsk
Advocate
Advocate


Joined: 24 Oct 2003
Posts: 2870
Location: Bay Area, CA

PostPosted: Sat Apr 03, 2010 7:10 pm    Post subject: Reply with quote

Ok, did another test: stored the pigz compressed tar on lessfs instead of the direct uncompressed tar: the time taken is 3m38sec and the space used is 5.96GiB for 5.86GiB file. Again, no dupes! I don't believe this.
Back to top
View user's profile Send private message
devsk
Advocate
Advocate


Joined: 24 Oct 2003
Posts: 2870
Location: Bay Area, CA

PostPosted: Sat Apr 03, 2010 8:44 pm    Post subject: Reply with quote

Ok, that was dumb! The gains from lessfs will be seen the next time I want to back up the system full. The gzip based tar will remain the same size or grow but lessfs based backup will be sort of differential.

So, to test out this conjecture, I backed up the same folder twice. And I can see that it eats up less than 10MB of space for second backup.

It will be interesting to compare against the file-based differential backup provided by tar and dar. I use dar and find the differential to be a great feature, but its file based and not block based, and hence a single block may store a big file again while in the block based scheme the whole file won't get saved. I think this is where people are reporting success with vmware images (which are large files and only partially change at the block level).
Back to top
View user's profile Send private message
platojones
Veteran
Veteran


Joined: 23 Oct 2002
Posts: 1595
Location: Just over the horizon

PostPosted: Sat Apr 03, 2010 8:59 pm    Post subject: Reply with quote

This is intriguing. Very nice results for compressed image backups. I use rdiff-backup for my backups, so it's probably not going to do much for me there (since I'm just saving diffs snapshots). However, I also run vmware with a virtual disk...there is where it could payoff, especially with snapshots.
Back to top
View user's profile Send private message
Moriah
Advocate
Advocate


Joined: 27 Mar 2004
Posts: 2117
Location: Kentucky

PostPosted: Sun Apr 04, 2010 2:45 am    Post subject: Reply with quote

You finally arrived at the right answer, devsk, that the savings come the second time around.

I am using rsync to back up all the computers on a network. I keep a current_backup directory on the backup server for each computer. After the nightly rsync for that computer is finished, I create a second directory using "cp -a ..." to use hard links to make a copy of the entire filesystem on each machine, so the paths look like "/bu/snapshots/<machine>/YYYY/MM/DD/hh.mm.ss/<filesystem_directory)tree>; this amounts to file level dedup, since files that did not change only get a hard link created, while files that did change get the old copy deleted from the current_backup directory for that machine and the now copy added to it, all before the hard linking takes place. This lets me keep a complete directory tree for each machine for each night, but underneath is all smoke, mirrors, and hard links. 8)

By using a block level dedup system like lessfs, I would gain several advantages. First, I could share comonality between machines for common files, such as operating system stuff. Second, log files, which grow every night, could share their common older sections. Third, I would not need the hard link operation anymore, saving some time, which would offset the slower operation you seem to be getting from lessfs.

If you want to increase comonality, do it with SMALLER blocks, not bigger ones. Think about it: if each block were only one bit long, you would hav a tremendous amount of sharing, whereas if each block were a petabyte, it would be a miracle if you gat any sharing at all.
_________________
The MyWord KJV Bible tool is at http://www.elilabs.com/~myword

Foghorn Leghorn is a Warner Bros. cartoon character.
Back to top
View user's profile Send private message
Moriah
Advocate
Advocate


Joined: 27 Mar 2004
Posts: 2117
Location: Kentucky

PostPosted: Sun Apr 04, 2010 2:53 am    Post subject: Reply with quote

The advantage with vmware is you can have many machines cloned off of a single base machine, or snapshots of a machine forming a chain of snapshots when you are debugging, etc. If you are using lessfs for the filesystem under the vmx files, then you save a bunch of space, and it is faster, since you do not need to actually copy all that sameness every time you create a new clone or snapshot.

But remember, smaller blocksizes give greater sharing. The sensible thing to do in my mind is to make the dedup blocksize the same as the chunk size, cluster size, etc. of the filesystem you are planning to run under lessfs, or the filesystem you aree planning to back up to lessfs.

I'm not sure how lessfs stores compressed blocks, or whether that compression is always going to be desirable. I am supprised that you say there is no way to turn it off. Maybe its time to fork a branch off the base code... :twisted:
_________________
The MyWord KJV Bible tool is at http://www.elilabs.com/~myword

Foghorn Leghorn is a Warner Bros. cartoon character.
Back to top
View user's profile Send private message
Moriah
Advocate
Advocate


Joined: 27 Mar 2004
Posts: 2117
Location: Kentucky

PostPosted: Sun Apr 04, 2010 3:10 am    Post subject: Reply with quote

Back to backups for a while...

It would be really nice to have both block level dedup and file level dedup in the same backup filesystem. The reason is forensics. If your network is compromised, a nightly complete backup for every machine can be a great aid in assesing the damage and determining the cause. A file level dedup on top of the block level dedup would only have to store a pointer to the list of pointers to data blocks associated with the hashes. If 2 files held the same contents, they would point to the same list of data block pointers. By reverse indexing these top level pointers, you could get a list of all files that had the same contents, even if they were on different machines, had different names or paths, or any combination of these. This would let you, for instance, see all the machines on your network that had been infected with the same rootkit, bot, worm, or whatever. By looking at the timestamps on these files, you could even tell where the first infection occurred.

Now on to the implementation of deletion of a block in a block level dedup filesystem such as lessfs...

It would seem to me to be overly burdensome to free up a data block after the last reference to it disappeared. You would need to implement some sort of reference counting mechanism where each <data block, hash code, pointer> would need to keep track of how many references there were to that data block. This counter would need to be incremented when a reference was created, and decremented when a reference was deleted. That's a lot of storage to hold all those counters, and a lot of time to keep them updated. Furhtermore, the updating would need to be protected as a critical section to avoid race conditions.

It seems to me that it would make more sense to have a garbage collection operation that would scan thru all the pointers to hashes and clean up any that were unused. Another approach would be to just let the unused <hash code, pointer, data block> objects hang around on the long shot chance that someday they would get re-used again.

Garbage collection could proceed in a manner similar to the beloved windoze defrag operation. :x

An alternative would be to implement an ephemeral (or background) garbage collector. An examination of Lisp and Java implementations could serve as some inspiration here.

A third approach would be to just do a file level copy of the entire lessfs filesystem contents from one volume to another. All the unreferenced stuff would just get left behind. :)
_________________
The MyWord KJV Bible tool is at http://www.elilabs.com/~myword

Foghorn Leghorn is a Warner Bros. cartoon character.
Back to top
View user's profile Send private message
devsk
Advocate
Advocate


Joined: 24 Oct 2003
Posts: 2870
Location: Bay Area, CA

PostPosted: Sun Apr 04, 2010 3:18 am    Post subject: Reply with quote

One thing I found was that 'tar' created file has to be block aligned with "-b 256" like in "time tar cpf /mybackup/myroot.tar / --one-file-system -b 256" to have second backup recognize duplicate blocks. Otherwise, a single file change can ruin the whole tar file and none of the blocks will be found to be duplicate.

Throw in the compression and I think this is a complex problem. I hope lessfs starts providing LZMA2 compression so that I can use that for backup purposes and use LZO for more live data. LZO is faster but its not space efficient at all. The multi-threaded 7za is able to compress files on i7 920 (8 CPU threads) machine in less time than gzip with almost half the size. And multi-threaded architecture of LessFS gives me that advantage right away.
Back to top
View user's profile Send private message
devsk
Advocate
Advocate


Joined: 24 Oct 2003
Posts: 2870
Location: Bay Area, CA

PostPosted: Sun Apr 04, 2010 3:24 am    Post subject: Reply with quote

They have a defrag command available through telnet interface (telnet localhost 100). Type quick because it times you out quick...:-)
Back to top
View user's profile Send private message
Moriah
Advocate
Advocate


Joined: 27 Mar 2004
Posts: 2117
Location: Kentucky

PostPosted: Sun Apr 04, 2010 5:09 am    Post subject: Reply with quote

The idea with a dedup system is *NOT* to send it compressed tarballs! :o

The best ordinary compression only gets you an average of 2x to 3x over an typical entire machine's filesystem. With dedup, you just send it the files, one at a time, or use rsync if you want to reduce network traffic (which is my recomendation). Even with my somewhat simplistic approach to file leve dedup, I am getting 20x to 30x over a month's worth of files. You just cannot even come close to that with gzip, bzip2, or anything like that, because they cannot share structure over an extended period of time like a dedup system can. I would expect considerably better compression with a block level dedup system, for all the reasons I outlines in my posts earlier.

I do not think the extra effort of block by block compression is even worth it, but I haven't run any tests. They say they are getting write speeds of up to 350 MB/sec, which almost fills a SATA-2 pipe. The only way I could easily get that speed with raw disks is to RAID-0 a few SDD drives. I am getting 150 MB/sec on a 250 GB SDD on my laptop now. 8)
_________________
The MyWord KJV Bible tool is at http://www.elilabs.com/~myword

Foghorn Leghorn is a Warner Bros. cartoon character.
Back to top
View user's profile Send private message
devsk
Advocate
Advocate


Joined: 24 Oct 2003
Posts: 2870
Location: Bay Area, CA

PostPosted: Sun Apr 04, 2010 5:28 am    Post subject: Reply with quote

Moriah wrote:
The idea with a dedup system is *NOT* to send it compressed tarballs! :o

The best ordinary compression only gets you an average of 2x to 3x over an typical entire machine's filesystem. With dedup, you just send it the files, one at a time, or use rsync if you want to reduce network traffic (which is my recomendation). Even with my somewhat simplistic approach to file leve dedup, I am getting 20x to 30x over a month's worth of files. You just cannot even come close to that with gzip, bzip2, or anything like that, because they cannot share structure over an extended period of time like a dedup system can. I would expect considerably better compression with a block level dedup system, for all the reasons I outlines in my posts earlier.

I do not think the extra effort of block by block compression is even worth it, but I haven't run any tests. They say they are getting write speeds of up to 350 MB/sec, which almost fills a SATA-2 pipe. The only way I could easily get that speed with raw disks is to RAID-0 a few SDD drives. I am getting 150 MB/sec on a 250 GB SDD on my laptop now. 8)
yeah, that's why uncompressed tar works better.

I agree that over a period of time, savings are huge without compression unless you are doing a compressed differential backup already. But the current LZO compression in LessFS works AFTER dedup, which reduces the size considerably. I think this is a good idea.
Back to top
View user's profile Send private message
Moriah
Advocate
Advocate


Joined: 27 Mar 2004
Posts: 2117
Location: Kentucky

PostPosted: Sun Apr 04, 2010 5:33 am    Post subject: Reply with quote

While it reduces the size, there is the price of decompressing every time the block is read. Remember, each block is independently compressed. This is necessary because that one block could appear in a lot of different contexts.

Also, these files need to be able to be directly addressed, not just sequentially read from start to finish, so you need to be able to start the decompression beginning with any block in the file being read.

Given that compression only needs to occur once, and decompression needs to occur every time the block is read, I would look for a compression algorithm with exceptional decompression performance, even if the initial compression was not the fastest among all the options available.
_________________
The MyWord KJV Bible tool is at http://www.elilabs.com/~myword

Foghorn Leghorn is a Warner Bros. cartoon character.
Back to top
View user's profile Send private message
devsk
Advocate
Advocate


Joined: 24 Oct 2003
Posts: 2870
Location: Bay Area, CA

PostPosted: Sun Apr 04, 2010 5:41 am    Post subject: Reply with quote

Moriah wrote:
While it reduces the size, there is the price of decompressing every time the block is read. Remember, each block is independently compressed. This is necessary because that one block could appear in a lot of different contexts.

Also, these files need to be able to be directly addressed, not just sequentially read from start to finish, so you need to be able to start the decompression beginning with any block in the file being read.

Given that compression only needs to occur once, and decompression needs to occur every time the block is read, I would look for a compression algorithm with exceptional decompression performance, even if the initial compression was not the fastest among all the options available.
LZO is that algo. Its the fastest in decompression. LZMA is fast on decompression also but requires more memory I think. That's the reason why squashfs is moving to LZMA sometime soon.
Back to top
View user's profile Send private message
devsk
Advocate
Advocate


Joined: 24 Oct 2003
Posts: 2870
Location: Bay Area, CA

PostPosted: Sun Apr 04, 2010 6:07 am    Post subject: Reply with quote

What do you do about the Windows partitions like NTFS? How do you back up those? rsync? restore is a pain because we lose all the security and other FS attributes which Windows loves! I typically end up doing ntfsclone for them.

But now with dedup, I am hoping my ntfsclone's will be much smaller.

Edit: That was not meant to be. A small change (like copying few MB of files) in NTFS volume throws off blocks dumped by ntfsclone and no dupes are found, and the second copy is brand new....:-(

EDIT: ok, the same trick that I used with 'tar' comes handy with ntfsclone. Pass the ntfsclone output through 'dd' to align it at 128KB (the block size of lessfs). clone to lessfs, followed by copying of few MB files, and clone again, gives only few MB more in the image. YAY!
Back to top
View user's profile Send private message
Moriah
Advocate
Advocate


Joined: 27 Mar 2004
Posts: 2117
Location: Kentucky

PostPosted: Sun Apr 04, 2010 2:45 pm    Post subject: Reply with quote

I just use rsync from a cygwin window...
_________________
The MyWord KJV Bible tool is at http://www.elilabs.com/~myword

Foghorn Leghorn is a Warner Bros. cartoon character.
Back to top
View user's profile Send private message
devsk
Advocate
Advocate


Joined: 24 Oct 2003
Posts: 2870
Location: Bay Area, CA

PostPosted: Mon Apr 05, 2010 2:11 am    Post subject: Reply with quote

I started a backup of my data using lessfs and at the speed its going, it will it take it 10hours to copy 300GB of multimedia I have using rsync. Its averaging about 8-9MB/sec. This is with 64KB block size. I had expected faster because the media is very fast. I think there are bottlenecks in lessfs somewhere. For example, I never see CPU usage more than 12%, although I have configured 4 threads (i7 920).

I think lessfs is not useful with rsync at all. Its too slow.

EDIT: the backup is still going on and the rate has fallen to 6.5MB/sec. This crap doesn't scale. Its slowing down with increasing size of the DB.
Back to top
View user's profile Send private message
devsk
Advocate
Advocate


Joined: 24 Oct 2003
Posts: 2870
Location: Bay Area, CA

PostPosted: Tue Apr 06, 2010 6:11 am    Post subject: Reply with quote

The backup finished after marathon 1248 mins (i.e. 21 hours). It backed up ~610GB, the DB size is ~540GB (hence saving me ~70GB, nearly 11%, in compression and duplicate blocks). So, the overall speed was 8MB/sec, which is pretty low in my opinion.

I will only know its effectiveness when I create a full backup next week. I will time that and post here.
Back to top
View user's profile Send private message
Moriah
Advocate
Advocate


Joined: 27 Mar 2004
Posts: 2117
Location: Kentucky

PostPosted: Tue Apr 06, 2010 1:50 pm    Post subject: Reply with quote

Is next week's backup to be a *SECOND* backup of what you just backed up this week? If so, I expect it to go faster, and use much less storage.
_________________
The MyWord KJV Bible tool is at http://www.elilabs.com/~myword

Foghorn Leghorn is a Warner Bros. cartoon character.
Back to top
View user's profile Send private message
devsk
Advocate
Advocate


Joined: 24 Oct 2003
Posts: 2870
Location: Bay Area, CA

PostPosted: Tue Apr 06, 2010 1:52 pm    Post subject: Reply with quote

Moriah wrote:
Is next week's backup to be a *SECOND* backup of what you just backed up this week? If so, I expect it to go faster, and use much less storage.
yes, primarily. Whatever I backed up and whatever changes between now and next week (which is not much).
Back to top
View user's profile Send private message
Moriah
Advocate
Advocate


Joined: 27 Mar 2004
Posts: 2117
Location: Kentucky

PostPosted: Tue Apr 06, 2010 2:03 pm    Post subject: Reply with quote

That's what I meant. Remember, that first backup has to move everything across, and it has to create a datablock for nearly every block read. all subsequent backups send much less across, thanks to rsync, and nearly every block is a duplicate, which means no new data block creation, just a hash computation and lookup and indexing. This means fewer disk accesses on the target, and fewer seeks, which are the real time eaters. You should get much better time the second and subsequent times you backup, I would expect.
_________________
The MyWord KJV Bible tool is at http://www.elilabs.com/~myword

Foghorn Leghorn is a Warner Bros. cartoon character.
Back to top
View user's profile Send private message
Moriah
Advocate
Advocate


Joined: 27 Mar 2004
Posts: 2117
Location: Kentucky

PostPosted: Tue Apr 06, 2010 2:07 pm    Post subject: Reply with quote

Another place where a backup server gaind advantage with dedup is if shares comonality between different machines. My current file level dedup scheme on shares comonality within a single machine; if another machine has the same file, it makes another copy. With lessfs, that identical file, even though it is on another machine, gets shared between all the machines that have a copy of it. Now think of all the files that make up the os and libraries... :)
_________________
The MyWord KJV Bible tool is at http://www.elilabs.com/~myword

Foghorn Leghorn is a Warner Bros. cartoon character.
Back to top
View user's profile Send private message
Moriah
Advocate
Advocate


Joined: 27 Mar 2004
Posts: 2117
Location: Kentucky

PostPosted: Tue Apr 06, 2010 2:38 pm    Post subject: Reply with quote

Have you read the lessfs forum? It is at http://www.lessfs.com/wordpress/?p=165
A post from April 1, 2010 says:
Quote:

Jay says:
April 1, 2010 at 4:06 pm

Thanks a lot for your great job!

Currently i’m running LESSFS 1.0.8, compiled from source, on my Gentoo 32-Bit Notebook (2.4 GHz Intel Core2Duo with 4 GB RAM, 128 GB SSD).

Thanks to Data Deduplicating w/ LESSFS the Disk Amount of my 4 KVM Ubuntu Instances (Testfarm for GlusterFS) went down from approx. 9 GB to 1.8 GB:

brunk@jay ~/vmware/gluster $ ls -lhs
total 8.9G
512 drwxr-xr-x 2 root root 4.0K Apr 1 15:47 lost+found
2.6G -rw——- 1 brunk users 2.6G Apr 1 16:55 server-1.qcow2
641M -rw——- 1 brunk users 641M Apr 1 16:30 server-1.qcow2.new
2.4G -rw——- 1 brunk users 2.4G Apr 1 16:55 server-2.qcow2
1.7G -rw——- 1 brunk users 1.7G Apr 1 16:55 server-3.qcow2
1.7G -rw——- 1 brunk users 1.7G Apr 1 16:55 server-4.qcow2
128K -rwx—— 1 brunk users 757 Apr 1 16:34 start-server.sh

jay ~ # du -sch /data/*
1.7G /data/dta
60M /data/mta
1.8G total

Best Regards, Jay

This amounts to a similar situation as 4 backups, as it is 4 similar VM's.
_________________
The MyWord KJV Bible tool is at http://www.elilabs.com/~myword

Foghorn Leghorn is a Warner Bros. cartoon character.
Back to top
View user's profile Send private message
devsk
Advocate
Advocate


Joined: 24 Oct 2003
Posts: 2870
Location: Bay Area, CA

PostPosted: Tue Apr 06, 2010 3:18 pm    Post subject: Reply with quote

VMs need fast IO (IO is more of a bottleneck than CPU for VM). Unless and until I am testing some temporary VMs, I would never store a VM inside lessfs.
Back to top
View user's profile Send private message
Moriah
Advocate
Advocate


Joined: 27 Mar 2004
Posts: 2117
Location: Kentucky

PostPosted: Tue Apr 06, 2010 3:41 pm    Post subject: Reply with quote

But can't a VM relinquish the cpu while it is waiting for io? That allows another task/thread to execute instead, so the individual vm might not be as fast, but the overall utilization of the host cpu would be greater, and its overall thruput would be hgher.
_________________
The MyWord KJV Bible tool is at http://www.elilabs.com/~myword

Foghorn Leghorn is a Warner Bros. cartoon character.
Back to top
View user's profile Send private message
yther
Apprentice
Apprentice


Joined: 25 Oct 2002
Posts: 151
Location: Charlotte, NC (USA)

PostPosted: Thu Apr 08, 2010 1:25 am    Post subject: Reply with quote

Quite an interesting thread here! I'm thinking of using something like this for a NAS where I work (probably in a year or two, though, as we just bought hardware and I don't feel like selling my boss on a home-grown NAS at the moment), as I have seen the benefits of single-instance storage of PC backups on our WHS-based appliance. In my job I have to store lots of VHDs, which are all various versions of Windows, so considering the entire set of nearly 100 VHDs there is going to be quite a lot of redundant content inside those VHDs. The WHS appliance, however, only applies its magical block-level dedupe to the backups, and not to any other file sets stored on it, so I am still going to be resorting to 7z to cut down space there. With a system like lessfs, ZFS, and others, however, I see the potential for huge space savings without the man-hours required to compress everything.

It's nice to have something like lessfs available for testing, before actually trying to do this. :)
Back to top
View user's profile Send private message
devsk
Advocate
Advocate


Joined: 24 Oct 2003
Posts: 2870
Location: Bay Area, CA

PostPosted: Mon Apr 12, 2010 5:43 am    Post subject: Reply with quote

I have started playing around with zfs-fuse and dedup branch. I am using gzip compression in ZFS currently. I will post a comparison later some time. It looks like zfs-fuse has performance advantages compared to lessfs.
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Unsupported Software All times are GMT
Goto page Previous  1, 2, 3  Next
Page 2 of 3

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum