My name is Philipp C. Heckel and I write about nerdy things.

Minimizing remote storage usage and synchronization time using deduplication and multichunking: Syncany as an example


Cloud Computing, Distributed Systems, Security, Synchronization

Minimizing remote storage usage and synchronization time using deduplication and multichunking: Syncany as an example


Contents


1. Introduction
2. Related Work
3. Deduplication
4. Syncany
5. Implications of the Architecture
6. Experiments
7. Future Research
8. Conclusion
A. List of Configurations
B. Pre-Study Folder Statistics
C. List of Variables Recorded
D. Best Algorithms by Deduplication Ratio
E. Best Algorithms by Duration
F. Best Algorithms by CPU Usage
Bibliography

Download as PDF: This article is a web version of my Master’s thesis. Feel free to download the original PDF version.


Bibliography

[1]
B. Collins-Sussman, B.W. Fitzpatrick, and C.M. Pilato. Version control with subversion. O’Reilly Media, Inc., 2004.
[2]
Canonical, Inc. Bazaar 2.2 Documentation, 2011.
[3]
Canonical, Inc. Bazaar FAQ: Are binary files handled?.
[4]
Dropbox, Inc. Dropbox FAQ: Does Dropbox always upload/download the entire file any time a change is made?.
[5]
ExaGrid Systems. ExaGrid EX Series Product Line.
[6]
Michael Vrable, Stefan Savage, and Geoffrey M. Voelker. Cumulus: Filesystem backup to the cloud. Trans. Storage, 5:14:1–14:28, December 2009.
[7]
Data Domain LLC. Data domain appliance series.
[8]
Antony Adshead. A comparison of data deduplication products, January 2009.
[13]
Eu jin Goh, Hovav Shacham, Nagendra Modadugu, and Dan Boneh. Sirius: Securing remote untrusted storage. In in Proc. Network and Distributed Systems Security (NDSS) Symposium 2003, pages 131–145, 2003.
[14]
Deepavali Bhagwat, Kave Eshghi, Darrell D. E. Long, and Mark Lillibridge. Extreme binning: Scalable, parallel deduplication for chunk-based file backup. In MASCOTS, pages 1–9. IEEE, 2009.
[16]
Dan Rosenberg. On-disk authenticated data structures for verifying data integrity on outsourced file storage.
[23]
Mahesh Kallahalla, Erik Riedel, Ram Swaminathan, Qian Wang, and Kevin Fu. Plutus: Scalable secure file sharing on untrusted storage, 2003.
[26]
Aameek Singh and Ling Liu. Sharoes: A data sharing platform for outsourced enterprise storage environments. In Proceedings of the 2008 IEEE 24th International Conference on Data Engineering, pages 993–1002, Washington, DC, USA, 2008. IEEE Computer Society.
[33]
John R. Douceur, Atul Adya, William J. Bolosky, Dan Simon, and Marvin Theimer. Reclaiming space from duplicate files in a serverless distributed file system. In In Proceedings of 22nd International Conference on Distributed Computing Systems (ICDCS, 2002.
[34]
George Forman, Kave Eshghi, and Stephane Chiocchetti. Finding similar files in large document repositories. In Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, KDD ’05, pages 394–400, New York, NY, USA, 2005. ACM.
[35]
Dave Cannon. Data Deduplication and Tivoli Storage Manager, March 2009.
[36]
A. Muthitacharoen, B. Chen, and D. Mazieres. A low-bandwidth network file system. ACM SIGOPS Operating Systems Review, 35(5):174–187, 2001.
[37]
Erik Kruus, Cristian Ungureanu, and Cezary Dubnicki. Bimodal content defined chunking for backup streams. In Proceedings of the 8th USENIX conference on File and storage technologies, FAST’10, pages 18–18, Berkeley, CA, USA, 2010. USENIX Association.
[38]
A. Spiridonov, S. Thaker, and S. Patwardhan. Sharing and bandwidth consumption in the low bandwidth file system. Technical report, Citeseer, 2005.
[39]
Kave Eshghi and Hsiu K. Tang. A framework for analyzing and improving content-based chunking algorithms.
[40]
N. Mandagere, P. Zhou, M.A. Smith, and S. Uttamchandani. Demystifying data deduplication. In Proceedings of the ACM/IFIP/USENIX Middleware’08 Conference Companion, pages 12–17. ACM, 2008.
[41]
M.W. Storer, K. Greenan, D.D.E. Long, and E.L. Miller. Secure data deduplication. In Proceedings of the 4th ACM international workshop on Storage security and survivability, pages 1–10. ACM, 2008.
[42]
M. Dutch. Understanding data deduplication ratios. In SNIA Data Management Forum, 2008.
[43]
Dutch T. Meyer and William J. Bolosky. A study of practical deduplication. In Proceedings of the 9th USENIX conference on File and stroage technologies, FAST’11, pages 1–1, Berkeley, CA, USA, 2011. USENIX Association.
[44]
A. Tridgell. Efficient algorithms for sorting and synchronization. PhD thesis, PhD thesis, The Australian National University, 1999.
[45]
K. Jin and E.L. Miller. Deduplication on Virtual Machine Disk Images. PhD thesis, University of California, Santa Cruz, 2010.
[46]
N. Ramsey, E. Csirmaz, and others. An algebraic approach to file synchronization. ACM SIGSOFT Software Engineering Notes, 26(5):175–185, 2001.
[47]
T. Suel, P. Noel, and D. Trendafilov. Improved file synchronization techniques for maintaining large replicated collections over slow networks. In Data Engineering, 2004. Proceedings. 20th International Conference on, pages 153–164. IEEE, 2004.
[48]
Utku Irmak, Svilen Mihaylov, and Torsten Suel. Improved single-round protocols for remote file synchronization. 2005.
[49]
T. Suel and N. Memon. Algorithms for delta compression and remote file synchronization. Lossless Compression Handbook, 2002.
[50]
Benjamin C. Pierce and Jérôme Vouillon. What’s in unison? a formal specification and reference implementation of a file synchronizer. Technical report, 2004.
[51]
Kalpana Sagar and Deepak Gupta. Remote file synchronization single-round algorithms.International Journal of Computer Applications, 4(1):32–36, July 2010. Published By Foundation of Computer Science.
[52]
Bryan O’Sullivan. Distributed revision control with Mercurial, 2009.
[53]
Yasushi Saito and Marc Shapiro. Optimistic replication. ACM Comput. Surv., 37:42–81, March 2005.
[54]
Anne-Marie Kermarrec, Antony Rowstron, Marc Shapiro, and Peter Druschel. The icecube approach to the reconciliation of divergent replicas. In Proceedings of the twentieth annual ACM symposium on Principles of distributed computing, PODC ’01, pages 210–218, New York, NY, USA, 2001. ACM.
[55]
Yasushi Saito and Marc Shapiro. Replication: Optimistic approaches. Technical report, 2002.
[56]
Benjamin C. Pierce. Foundations for bidirectional programming, or: How to build a bidirectional programming language, June 2009. Keynote address at International Conference on Model Transformation (ICMT).
[57]
S. Balasubramaniam and B.C. Pierce. What is a file synchronizer? In Proceedings of the 4th annual ACM/IEEE international conference on Mobile computing and networking, pages 98–108. ACM, 1998.
[58]
M. Ajtai, R. Burns, R. Fagin, D.D.E. Long, and L. Stockmeyer. Compactly encoding unstructured inputs with differential compression. Journal of the ACM (JACM), 49(3):318–367, 2002.
[60]
Duke University. NFS & AFS Lecture.
[61]
Data Domain LLC. Deduplication FAQ.
[64]
EMC Corporation. IBM TSM Backup with EMC Data Domain Deduplication Storage, October 2010.
[65]
TechTarget SearchStorage. How to evaluate software-based data deduplication products, November 2007.
[67]
Calicrates Policroniades and Ian Pratt. Alternatives for detecting redundancy in storage systems data. In Proceedings of the annual conference on USENIX Annual Technical Conference, ATEC ’04, pages 6–6, Berkeley, CA, USA, 2004. USENIX Association.
[68]
Dirk Meister and André Brinkmann. Multi-level comparison of data deduplication in a backup scenario. In Proceedings of SYSTOR 2009: The Israeli Experimental Systems Conference, SYSTOR ’09, pages 8:1–8:12, New York, NY, USA, 2009. ACM.
[69]
Benjamin Zhu, Kai Li, and Hugo Patterson. Avoiding the disk bottleneck in the data domain deduplication file system. In Proceedings of the 6th USENIX Conference on File and Storage Technologies, FAST’08, pages 18:1–18:14, Berkeley, CA, USA, 2008. USENIX Association.
[70]
Linux man page for ‘top’.
[71]
Linux man page for ‘iostat’.
[72]
Alon Orlitsky and Krishnamurthy Viswanathan. Practical protocols for interactive communication. In ISIT2001, volume June 24-29, 2001.
[73]
Hao Yan. Low-latency file synchronization in distributed systems.
[74]
Uzi Vishkin. Communication complexity of document exchange.
[75]
Sun Microsystems. Nfs: Network file system protocol specification, 1989.
[76]
S. Shepler, B. Callaghan, D. Robinson, R. Thurlow, C. Beame, M. Eisler, and D. Noveck. Network File System (NFS) version 4 Protocol. RFC 3530 (Proposed Standard), April 2003.
[77]
Ben Collins-Sussman, Brian W. Fitzpatrick, and C. Michael Pilato. Versioning Models, volume r3305. 2008.
[98]
P. Deutsch and J.-L. Gailly. Zlib compressed data format specification version 3.3, 1996.
[99]
M.O. Rabin. was at: http://books.google.de/books?id=5gW1PgAACAAJ, site now defunct, July 2019, Fingerprinting by random polynomials. TR // Center for Research in Computing Technology, Harvard University. Center for Research in Computing Techn., Aiken Computation Laboratory, Univ., 1981.

1. Introduction
2. Related Work
3. Deduplication
4. Syncany
5. Implications of the Architecture
6. Experiments
7. Future Research
8. Conclusion
A. List of Configurations
B. Pre-Study Folder Statistics
C. List of Variables Recorded
D. Best Algorithms by Deduplication Ratio
E. Best Algorithms by Duration
F. Best Algorithms by CPU Usage
Bibliography

Pages:<12 ... 1415

3 Comments

  1. JP

    Hi,

    I would love to see a ebook version of your thesis (epub or mobi). Would that be possible ?

    thanks



  2. Thiruven Madhavan

    Hi Philipp:
    Good Morning. Possible to receive pdf version of your thesis.
    cheers
    Madhavan