January 20, 2009
Data deduplication goes a long way toward reducing data storage costs by making storage much more efficient, which in turn can reduce the overall footprint inside the data center. Knowledge Center contributor Chris Poelker explains data deduplication's benefits, including how leveraging data deduplication can help green your data center.What is data deduplication? What are its benefits? In simplified terms, data deduplication means comparing objects (usually files or blocks) and removing all non-unique objects (that is, copies). The basic benefits of data deduplication can be summarized as follows: reduced hardware costs, reduced data center footprint, reduced backup costs, reduced costs for disaster recovery, and increased efficiency use of storage.
If you look at the left side of the figure below, you will see several blocks being stored that are not unique. The data deduplication process removes any blocks that are not unique, resulting in the smaller group of blocks to the right.

You can apply data deduplication in multiple places. Wherever you apply it, data deduplication can affect costs not only for your Storage Area Network (SAN), but also for your entire IT infrastructure.
Based on an enterprise environment running typical applications, you probably could squeeze out between 10 to 20 percent more storage space just by getting rid of duplicate and unnecessary files. Files are commonly known as "unstructured data" and the data residing in databases is commonly known as "structured data." Simple unstructured data in files can therefore be deduplicated at the file system level, but the structured data residing in large databases is typically deduplicated underneath the actual operating system's file system at the block level.
Interestingly though, since block-level deduplication does not need to understand the file system, it is sometimes even more efficient to deduplicate files at the block level. Whether you choose a solution that works at the block level, file level or both, you will find that it can pay for itself extremely fast in the amount of savings you get from storage, media, power, cooling and floor space costs.
1. Divide the input data into blocks or chunks
2. Calculate a hash value for the data
3. Use the hash value to determine whether another block of data has already been stored
4. Replace the original data with a reference to an object in the database
You can implement the actual process of data deduplication in several ways. For example, you can eliminate duplicate data simply by comparing two files and deleting the one that's older or no longer needed, or you can use a commercial deduplication product. Commercial solutions use sophisticated methods and the actual math involved can make your head spin. If you want to understand all the nuances of the mathematical techniques used to find duplicate data, you should take college courses in statistical analysis, data security and cryptography (and hey, who knows-if your current line of work doesnrt pan out for you, maybe you could get a job at the CIA).
Most of the data deduplication solutions on the market today use standard data encryption techniques to create a unique mathematical representation of the dataset in question-a hash-so that the hash can be compared with any new hashes to determine whether the data is unique. The hash also serves as the metadata (that is, the data about other data) for the chunk of data in question. A hash used as metadata serves as an efficient index in a lookup table, allowing you to quickly determine whether or not any new data being stored already is present and can be eliminated.
Why data deduplication is important
Data deduplication goes a long way toward reducing data storage costs by making storage much more efficient, which in turn can reduce the overall footprint inside the data center. Just think: if by deduplicating your data you can store the exact same amount of information in less than one-tenth the footprint, imagine how much money and energy you could save in power and cooling costs.

Why tape is not so green
Some of the folks who sell tape will tell you, since tape does not require power after it's used, it's greener to use tape than disk-even if the data is deduplicated. They would be right. Tape takes up no power at rest. But some of those older, massive tape libraries need a nuclear power plant to operate. Disks draw a lot of power when they are spinning up, but draw much less during normal operation.
The other not-so-green fact about tape is that you end up with a lot of it over time. If your Disaster Recovery (DR) strategy is to ship tapes offsite for recovery or storage, those tapes are using a heck of a lot of gasoline that disk drives don't need. In fact, a VTL that implements deduplication can electronically replicate the data to another VTL at a different location-which would also green the other data center. Also, the most prevalent VTL solution can encrypt the replicated virtual tapes so there is no risk of losing or misplacing sensitive data.
How backup environments benefit
Let's look at a typical backup environment as an example, since that is the area that benefits greatly from data deduplication. Data deduplication solutions can be implemented in many places but data backup and data archiving are the areas where benefits are immediately apparent. The more data you have, and the longer you need to retain it for business reasons or regulatory purposes, the better results you see from your data deduplication solution.
The figure below shows a sample dataset of 20 TB being retained over five weeks, with typical data growth and change rates. If you use a traditional backup solution (such as Veritas NetBackup, CommVault, IBM Tivoli Storage Manager (TSM), EMC Legato or HP Data Protector) to back up the data to media (disk or tape) with no deduplication, you'll need to store more than 101 TB of data in only five weeks. [Okay, for you IBMers out there, TSM is a progressive backup solution, so you will probably store less on tape but don't get me started on all the disk-based file systems being used for the D2D (disk to disk) part of the backup!]
In the figure below, you can see that after five weeks with no deduplication going on, you will have stored about 110 TB of data.

Now let's take the same metrics and apply a deduplication ratio of a little over 6-to-1. Instead of storing 110 TB, we now only need to store a little more than 24 TB for the exact same amount of information.
All things being equal, we can see that data deduplication can offer a dramatic savings in data center floor space, tape media costs, tape storage and shipping costs. And, if used in conjunction with disks as a backup methodology, much faster recovery if something goes wrong.
The green aspects of data deduplication even extend outside the data center to the trucks that are no longer required to ship bulky tapes offsite. I haven't even mentioned yet how data deduplication can improve disaster recovery. Less WAN bandwidth needed to replicate data is a major benefit. Another benefit is, if you send less, you store less on the other side-which relates to the cost of storage, power and cooling of the DR location. So you can see, the value and the benefits can add up real fast, and that relates to a greener world for you in more ways than one.
Given the current state of the economy, storage consolidation is now a high priority for every IT organization. But for IT organizations running performance-sensitive applications, storage consolidation can be a major challenge.
Data storage needs are on the rise. But beyond simply providing more raw capacity, today¿s storage solutions must also be easy to provision and manage, energy-efficient, and highly scalable in performance and capacity. Download this white paper to learn about HP NAS clustering solutions that help meet today¿s rapidly changing storage requirements.
Organizations that deploy Microsoft Windows file servers receive many useful services. Traditional file servers, however, lack scalability, so organizations must add file servers as their data storage needs grow. This results in server sprawl, which leads to low utilization of the file servers and sub-optimal availability of storage. Learn how organizations benefit from consolidating their Windows file serving environments using HP Scalable NAS, a highly scalable, manageable and available storage solution.
Storage administrators are being challenged to manage enterprise data growth and maintain increasing service level commitments while keeping within budgets. This study examines the total cost of ownership of the new HP StorageWorks 9100 Extreme Data Storage System (ExDS9100) and compares it to three competitive approaches. Learn how the HP ExDS9100 is well positioned to deliver massive scalability in both capacity and performance, yet offers considerable cost advantages to meet today¿s storage challenges.
In this IT Link podcast hosted by Mike Vizard, Scott Campbell, HP manager of solutions architects, explains why HP is taking a different approach to managing storage using a new XDS architecture specifically designed to handle the requirements of rapidly growing unstructured data storage.
In this IT Link podcast hosted by Mike Vizard, Efren Molina, PolyServe technical specialist for HP, explains how NAS cluster technology is being used to help customers keep costs in line even as their storage requirements continue to balloon.
In this IT Link podcast hosted by Mike Vizard, Logicalis vice president of consulting Eric Linxweiler explains why storage management software is becoming a strategic issue as the amount and types of data that needs to be managed continues to explode.
NAS has always been simple, unless IT managers wanted to grow their NAS storage significantly. For the first time, storage administrators are thinking in terms of managing petabytes of storage, making massive storage build-outs a necessity. Learn how companies can affordably meet these demands with a simply managed, highly scalable NAS environment.
This solution brief explores HP’s next generation of Scalable NAS and how it caters to every business continuity need by being highly available and easy to deploy while adding levels of affordable, fault tolerant data protection and availability.
When IT administrators are looking for networked storage solutions, they often look to NAS because they can use the Ethernet infrastructure they are familiar with to build pools of storage for significantly less money than SAN with equivalent capacity. Unfortunately, traditional NAS doesn't scale and administrators find themselves having to add NAS platforms to keep up with growing storage demands. As a result, many administrators have started looking for alternative solutions.
Learn how HP's Scalable NAS solution offers central management and administration, scalable capacity and improved utilization, with a lower total cost of ownership (TCO)
Watch this demo and learn how HP's next generation of Scalable NAS is well suited for streaming media serving solutions.
When Roswell Park Cancer Institute (RPCI) needed to remain on the front line of research and to continue providing high-quality care for patients, they chose a comprehensive HP storage solution and improved storage capacity, performance and scalability.
When Crest Animation looked to take on an increased workload and handle High Definition and 2K film animations, the company chose a comprehensive HP storage solution that has given the company a unified, highly reliable storage infrastructure.
Oracle Database and the Oracle E-Business Suite are at the heart of most commercial data centers. HP's Scalable NAS solution Create a scalable infrastructure for Oracle consolidation and file serving.
The new Web 2.0 business model, where the data is the business, utilizes the Internet to disseminate information in many different ways.
NAS has been rapidly evolving as a storage alternative for Oracle databases, and for good reason: NAS is often the simplest, most cost-effective storage approach for Oracle.
Windows File Server and Storage Consolidation using HP EVA File Services.
For several years NAS has been evolving as a storage alternative for Oracle databases, and for good reason