OStor – data deduplication in the cloud – HowTo

Quick Implementation Note

OStor is implemented using Java which allows for quick prototyping, platform independence and a language of choice in this generation. The core idea is based on this paper – Low-bandwidth file system. Those principles have been applied for both Wide-area acceleration as well as Data deduplication. A future blog post will include detailed implementation notes.

Pre-requisites

  • Java 1.6 (Havn’t tested with Java 1.5 and previous versions)
  • SVN to check out code.
  • External libraries – hadoop, log4j and commons-codec. These are including the SVN repository.

I have installed it out of box on my Mac OSX 10.5, 10.6 as well as Kubuntu 7 and later.

HowTo

Check out the code from http://code.google.com/p/ostor/

  • svn checkout http://ostor.googlecode.com/svn/trunk/ ostor

Read the README file – ostor/README

  • make install – find ostor.jar

Run ostor in interactive mode

  • java -cp *:*:jars/* com.ostor.dedup.core.DedupStorCli .dstor
  • type help to look at syntax
  • add file data/emacs.html – adds a file to the repository
  • show object all - dump all objects in the repository
  • INFO [main] (DedupObjectStor.java:170) – Dump object stor, number of object – 1
    INFO [main] (DedupObject.java:312) – [Object - data/emacs.html] length – 3170551 num segs – 143 unique:: segs – (143/143) size – (100%)
  • show segment all - dump all segments in the repository
  • INFO [main] (DedupSegment.java:212) – Dump segment – Id – SEGMENT-%D6%CBk%9D%EC%E6%88m%1F%1A%AC%E5%1E%23%FC*%81%F6e%60, len – 30720, num refs – 1, hash – 1strnezmiG0fGqzlHiP8KoH2ZWA=
    INFO [main] (DedupSegment.java:212) – Dump segment – Id – SEGMENT-Mfa%EDD%D7%7B%CF%C9%D0%3B%3D%FCk%1E%85cP%84%11, len – 30720, num refs – 1, hash – TWZh7UTXe8/J0Ds9/GsehWNQhBE=
    INFO [main] (DedupSegment.java:212) – Dump segment – Id – SEGMENT-%CB%B4%9F%7E%AA%9E%96%0EZ%EB8%B7%C0%D02%5DO%87%E8%95, len – 30720, num refs – 1, hash – y7Sffqqelg5a6zi3wNAyXU+H6JU=
  • …. and so on

In the next blog spot, I will describe how to run in standalone mode and then in Hadoop mode.

Advertisement

Tags:

3 Responses to “OStor – data deduplication in the cloud – HowTo”

  1. OStor – data deduplication in the cloud – HowTo « Praveen’s Weblog Says:

    [...] OStor – data deduplication in the cloud – HowTo By ppraveen Find blog posting here. [...]

  2. Balanagireddy Mudiam Says:

    Hi,

    I am a graduate student studying computer science in Stony Brook University. I am doing cloud computing course and I selected data deduplication in linux kernel as my course project. I find your project interesting. Can you give some high level overview of your project.

    Thank you.

    Regards
    Bala Mudiam

    • ppraveen Says:

      Hi Bala,

      The goal of the project was do an open source implementation of deduplication in the cloud. Anyone who stores data on a public cloud (Amazon, etc) and would like that data to be backed up or archived can take advantage of deduplication.

      I am not sure if this project will be useful to you given that this is done in Java and your goal is in the linux kernel.

      Thanks,
      Praveen.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Connecting to %s


Follow

Get every new post delivered to your Inbox.