Storage Cat 08 3/10/08 17:07 Page 38
38 ADVERTORIAL
THE ULTIMATE
DEDUPLICATION FAQ
Words: By Philip Turner, Regional Director, UK & Ireland, Data Domain
1. Why deduplicate data? compression effect higher), the retention period (longer
Eliminating redundant data can significantly shrink storage retention means more data to compare against), and the size
requirements and improve bandwidth efficiency. Because of the data set (more data, more to deduplicate).
primary storage has gotten cheaper over time, enterprises The deduplication technology approach and granularity of the
typically store many versions of the same information so that deduplication process will also affect compression rates.
new work can re-use old work. Some operations like Backup Data reduction techniques typically split each file into
store extremely redundant information. Deduplication lowers segments or chunks; the segment size varies from vendor to
storage costs since fewer disks are needed, and shortens vendor. If the segment size is very large, then fewer segment
backup/recovery times since there can be far less data to matches will occur, resulting in smaller storage savings (lower
transfer. In the context of backup and other nearline data, we compression rates). If the segment size is very small the
can make a strong supposition that there is a great deal of ability to find more redundancy in the data increases.
duplicate data. The same data keeps getting stored over and Vendors also differ on how to split up the data. Some
over again consuming a lot of unnecessary storage space (disk vendors split data into fixed length segments, while others
or tape), electricity (to power and cool the disk or tape use variable length segments.
drives), and bandwidth (for replication), creating a chain of
cost and resource inefficiencies within the organisation. 5. What’s the difference between fixed and variable
length segments?
2. How does data deduplication work? • Fixed-length segments (also blocks). The main limitation of
Deduplication segments the incoming data stream, uniquely this approach is that when the data in a file is shifted, for
identifies the data segments, and then compares the example when adding a slide to a PowerPoint deck, all
segments to previously stored data. If an incoming data subsequent blocks in the file will be rewritten and are likely
segment is a duplicate of what has already been stored, the to be considered as different from those in the original file,
segment is not stored again, but a reference is created to it. so the compression effect is less significant. Smaller blocks
If the segment is unique, it is stored on disk. will get better deduplication than large ones, but it will take
more processing to deduplicate.
3. Is SIS (Single Instance Store) a form of deduplication? • Variable-length segments. A more advanced approach is to
Reducing duplicate file copies is a limited form of deduplication anchor variable-length segments based on their interior
sometimes called single instance storage or SIS. This file data patterns. This solves the data shifting problem of the
level deduplication is intended to eliminate redundant fixed-size block approach.
(duplicate) files on a storage system by saving only a single
instance of data or a file. 6. What is the difference between inline vs.
If you change the title of a 2 MB Microsoft Word document, post-process deduplication?
SIS would retain the first copy of the Word document and Inline deduplication means the data is deduplicated before it is
store the entire copy of the modified document. Any change written to disk (inline). Post-process deduplication analyses
to a file requires the entire changed file be stored. Frequently and reduces data after it has been stored to disk.
changed files would not benefit from SIS. Data deduplication, Inline deduplication is the most efficient and economic
which reduces sub-file level data, would recognise that only the method of deduplication. Inline deduplication significantly
title had changed - and in effect only store the new title, with reduces the raw disk capacity needed in the system since the
pointers to the rest of the document's content segments. full, not-yet-deduplicated data set is never written to disk. If
replication is supported as part of the inline deduplication
4. What data deduplication rates are expected? process, inline also optimises time-to-DR (disaster recovery)
First, redundancy will vary by application, frequency of version far beyond all other methods as the system does not need to
capture and retention policy. Significant variables include the wait to absorb the entire data set and then deduplicate it
rate of data change (few changes mean more data to before it can begin replicating to the remote site.
deduplicate), the frequency of backups (more fulls makes Post-process deduplication technologies wait for the data to
Page 1 |
Page 2 |
Page 3 |
Page 4 |
Page 5 |
Page 6 |
Page 7 |
Page 8 |
Page 9 |
Page 10 |
Page 11 |
Page 12 |
Page 13 |
Page 14 |
Page 15 |
Page 16 |
Page 17 |
Page 18 |
Page 19 |
Page 20 |
Page 21 |
Page 22 |
Page 23 |
Page 24 |
Page 25 |
Page 26 |
Page 27 |
Page 28 |
Page 29 |
Page 30 |
Page 31 |
Page 32 |
Page 33 |
Page 34 |
Page 35 |
Page 36 |
Page 37 |
Page 38 |
Page 39 |
Page 40 |
Page 41 |
Page 42 |
Page 43 |
Page 44 |
Page 45 |
Page 46 |
Page 47 |
Page 48 |
Page 49 |
Page 50 |
Page 51 |
Page 52 |
Page 53 |
Page 54 |
Page 55 |
Page 56 |
Page 57 |
Page 58 |
Page 59 |
Page 60 |
Page 61 |
Page 62 |
Page 63 |
Page 64 |
Page 65 |
Page 66 |
Page 67 |
Page 68 |
Page 69 |
Page 70 |
Page 71 |
Page 72 |
Page 73 |
Page 74 |
Page 75 |
Page 76 |
Page 77 |
Page 78 |
Page 79 |
Page 80