This post introduces my Master’s thesis “Minimizing remote storage usage and synchronization time using deduplication and multichunking: Syncany as an example”. I submitted the thesis in January 2012, and now found a little time to post it here.
The key goal of this thesis was to determine the suitability of deduplication for end-user applications — particularly for my synchronization application Syncany. As part of this work, the thesis introduces Syncany, a file synchronizer designed with security and provider independence as a core part of its architecture.
Download as PDF: This article is a web version of my Master’s thesis. Feel free to download the original PDF version.
The recently arisen cloud computing paradigm has significantly influenced the way how people think of IT resources. Being able to rent computing time or storage from large providers such as Amazon or Rackspace has enabled a more flexible and cost-effective use of resources. While corporations mostly benefit from the significant cost savings of the pay-per-use model, private individuals rather use it as cheap external storage for backups or to share the private photo collection. Cloud-based backup and file sharing services such as Box.net, Dropbox or Jungle Disk have become very popular and easy-to-use tools for exactly this purpose. They deliver a very simple and intuitive interface on top of the remote storage provided by storage service providers. Key functionalities typically include file and folder synchronization between different computers, sharing folders with other users, file versioning as well as automated backups.
While their feature set and simplicity are very appealing to users, they all share the property of being completely dependent of the service provider — especially in terms of availability and confidentiality. This highlights the importance of the providers and outlines their central role in this scenario: Not only do they have absolute control over the users’ data, they also have to make sure unauthorized users cannot access it. While many users have been willing to trade this lack of confidentiality for a more convenient user experience, this level of trust towards the provider disappears when the data is very private or business critical.
To address the issues outlined above, this thesis introduces Syncany, a file synchronization software that allows users to backup and share certain files and folders among different workstations. Unlike other synchronization software, Syncany is designed with security and provider independence as an essential part of its architecture. It combines high security standards with the functionalities and ease of use of a cloud-based file synchronization application. In particular, Syncany does not allow access to data by any third party and thereby leaves complete control to the owner of the data. The goal of the architecture is to provide the same functionality as existing solutions, but remove provider dependencies and security concerns.
In contrast to other synchronization software, Syncany offers the following additional functionalities:
- Client-side encryption: Syncany encrypts files locally, so that even untrusted online storage can be used to store sensitive data.
- Storage abstraction: Syncany uses a plug-in based storage system. It can be used with any type of remote storage, e.g. FTP, Amazon S3 or WebDAV.
- Versioning: Syncany stores the entire history of files and folders. Old revisions of a file can be reconstructed from the remote storage.
Even though these features are not new ideas on their own, their combination is a rather new approach and does, to the best of the author’s knowledge, not exist in other file synchronization applications.
1.1. Problem Description
While Syncany is very similar to other file synchronizers from an end-user’s point of view, its functionalities demand for a more complex architecture. Especially when looking at the storage abstraction and versioning requirements, traditional synchronization software or frameworks cannot be used. Synchronizers such as rsync or Unison, for instance, completely lack any kind of versioning functionalities. Version control systems such as Git or Subversion on the other hand only support a few storage types and entirely lack encryption.
A rather naive approach to create a file synchronizer with the desired functionalities is to simply encrypt all files locally and then transfer them to the remote storage using the respective transfer protocol. Once a file is changed locally, it is copied using the same protocol. To ensure that the old version of the file is still available, new revisions are copied to a new filename.
While this approach is very simple and can be easily implemented using any kind of storage, it has many obvious flaws. Among others, it always copies files as a whole even if only a few bytes have changed. Because of that, the amount of data stored on the remote storage increases very quickly. Consequently, uploading updates and reconstructing files from the remote storage takes significantly longer than if only changes were transferred.
Syncany overcomes this problem with its own architecture: Storage abstraction, versioning and encryption are part of the core system. Using deduplication technology, Syncany delivers the same features as other synchronizers, and minimizes its storage requirements at the same time. Especially in terms of storage reduction, deduplication has proven to be very effective in various hardware and software based backup systems. In end-user applications, however, the technology is rarely used to its full extent.
Certainly one of the reasons for that is that the algorithms used in deduplication systems are very CPU intensive. The underlying data is analyzed very thoroughly and frequent index lookups increase the CPU usage even further. In most cases, only very coarse-grained deduplication mechanisms are used on the client and further data-reducing algorithms are performed on the server.
Another issue is created when combining deduplication with the storage abstraction concept. Due to the fact that deduplication algorithms generally create a large number of small files, the upload and download time tends to be very high. For Syncany, that means a high synchronization time between participating clients.
Overall, using the concept of deduplication as part of Syncany’s core is both the solution and the problem. On the one hand, it is one of the key technologies to create a functioning file synchronizer. Among others, it enables versioning, minimizes disk usage on the remote storage and reduces the amount of transferred data. On the other hand, the technology itself causes problems on the client. Examples include high CPU usage and an increased amount of upload and download requests.
Having seen a variety of issues in the overall architecture, this thesis focuses on very few of these aspects. In particular, the goal is to solve the problems created by introducing deduplication as a core concept. In the scope of this thesis that means demonstrating that deduplication is a valid technology for an end-user application and tackling the issues it raises. The overall goal is to find the deduplication algorithm best suited for Syncany’s environment as well as to make sure that synchronization time between clients is minimal.
Broken down into individual parts, the thesis aims to fulfill the following goals:
- Find a suitable deduplication algorithm for Syncany: Since deduplication is mostly used on dedicated server machines, the requirements for client machines differ from the usual environment. An algorithm is suitable for Syncany if it minimizes the amount of data that needs to be stored, but at the same time makes sure that the client resources are not too excessively used and synchronization time is adequate.
- Minimize the synchronization time between Syncany clients: The synchronization time is the sum of all subsequent processes performed to synchronize two or more Syncany clients. Minimizing it means accelerating the deduplication algorithm on the one hand, but also reducing the transfer time to and from the remote storage.
In order to fulfill these goals, this thesis analyzes the implications of the Syncany architecture on potential algorithms and performs experiments to find the optimal deduplication algorithm. It furthermore introduces the multichunk concept, a technique to reduce the upload and download time to and from the remote storage and validates its efficiency in experiments.
After this brief introduction, chapter 2 introduces related research and technologies. It particularly talks about research that influenced the development of Syncany. Among others, it discusses file sychronizers such as rsync and Unison in the context of the remote file synchronization problem. It then briefly presents the concepts of version control systems and finally explains the basic idea of distributed file systems.
Chapter 3 presents the fundamentals of deduplication. It gives a short overview of the basic concept and distinguishes deduplication from other capacity reducing technologies such as compression. The chapter then introduces relevant hardware and software based deduplication systems and briefly elaborates on their respective properties. Subsequently, it explains the different types of chunking methods — namely whole file chunking, fixed-size chunking, variable-size chunking and file type aware chunking. Finally, the chapter introduces and explains different quality measures for comparing deduplication mechanisms against each other.
After focusing on the technologies used, chapter 4 presents Syncany and explains how deduplication is used in it. It furthermore explains the motivation behind the software, defines high level requirements and elaborates on design choices made in the development. It then briefly introduces Syncany’s architecture and the concepts used in the software.
Chapter 5 illustrates the implications of the architecture on the thesis’ goals. Among others, it elaborates on bandwidth and latency issues and introduces the multichunk concept. It then discusses the design of potential deduplication algorithms and chooses the algorithm parameters for the experiments.
Chapter 6 uses these parameters to perform three experiments on different datasets. Experiment 1 aims to find the best parameter configurations with regard to chunking efficiency. Experiment 2 tries to maximize the upload bandwidth and experiment 3 measures the reconstruction time and size.
Before concluding the thesis, chapter 7 briefly discusses possible future research topics.
I'd very much like to hear what you think of this post. Feel free to leave a comment. I usually respond within a day or two, sometimes even faster. I will not share or publish your e-mail address anywhere.