Comparative analysis of data compression and deduplication
Data compression and deduplication both focus on reducing the amount of data. The difference lies in that the premise of data compression technology is redundancy in the data expression of information, which is based on information theory research. The implementation of deduplication relies on the repetition of data blocks and is a practical technique. However, the two techniques are essentially the same, reducing the size of the data by retrieving redundant data and representing it with shorter Pointers. The key difference between them lies in the different range of eliminating redundancy, the different methods of finding redundancy, the different granularity of redundancy, and the different methods of the specific implementation.
Eliminate the redundancy scope
Data compression is typically applied to data flows, eliminating redundancy limits limited by sliding or caching Windows. Due to performance considerations, this window is usually small and can only work with local data, especially for individual files. Deduplication is more effective for global storage systems that contain many files, such as file systems, by first partitioning all data into blocks and then eliminating redundancy on a global scale in blocks. If data compression is applied globally, or deduplication is applied to a single file, the data reduction effect is greatly reduced.
Redundancy discovery method
Data compression mainly retrieves the same data block through string matching, mainly using a string matching algorithm and its variants, which is exactly matching. The deduplication technique finds the same data block through the data fingerprint of the data block, and the data fingerprint is calculated by the hash function, which is fuzzy matching. Precise matching is more complex, but it has high precision and is more effective in eliminating redundancy. Fuzzy matching is much simpler and more suitable for large-grained data blocks.
The redundant granularity of data compression can be small, up to a few bytes in a small block, and is adaptive, without specifying a granularity range in advance. Deduplication is different, and data blocks are larger in granularity, typically ranging from 512 to 8K bytes. Data partitioning is also not adaptive. For fixed-length data blocks, the length needs to be specified in advance, while for variable-length data blocks, the upper and lower limits need to be specified. Smaller data block granularity will achieve greater data de-redundancy effect, but also there will are greater computational cost.
(4) Performance bottleneck
The key performance bottleneck of data compression is string matching. The larger the sliding window or cache window is, the more computations will be required. The performance bottleneck of deduplication is data partitioning and data fingerprinting. Hash functions such as MD5/SHA-1 have high computational complexity and consume CPU resources. In addition, data fingerprints need to be saved and retrieved, and a large amount of memory is often required to build hash tables, which can have a significant impact on performance if memory is limited.
Data compression processes streaming data directly without prior analysis and statistics of global information. It can be well used in combination with other applications by pipeline or pipeline, or transparently applied to storage systems or network systems in an in-band manner. Deduplication involves chunking data, storing and retrieving fingerprints, and logically representing the original physical file. If the technology is to be applied to existing systems, the application needs to be modified, making it difficult to implement transparently. Currently, deduplication is not a common feature but occurs more in product forms, such as storage systems, file systems, or application systems. Therefore, data compression is a standard feature, while deduplication is not yet achieved, and data compression is simpler from an application perspective.
Data compression and deduplication have different levels of focus and can be used in combination to achieve a higher proportion of data reduction. It is worth mentioning that if both data compression and deduplication are applied, in order to reduce the processing requirements on the system and increase the data compression ratio, data deletion is usually applied first, and then data compression is used to further reduce the volume of the "structure diagram" and the basic data block. What happens if the order is reversed?
Compression breaks the data's native redundant structure by recoding the data, so reapplying deduplication is less effective and takes more time. Doing deduplication first eliminates redundant data blocks and then applies data compression to compress a unique copy of the block again. In this way, the data reduction effect of the two technologies is superimposed, and the data compression consumption time is greatly reduced. Therefore, higher data compression rate and performance can be obtained by first removing weight and then compression.
Meanwhile, Vinchin Backup & Recovery has Data Deduplication and compression technology helps deduplicate and compress data from backup repository to reduce storage space and costs.
Interested Blogs More
Delete/Filter Oracle duplicate data
Backup, Restore and Instant Restore Working Principles with OpenStack
The evolution from traditional archive to archive to cloud
A comparison of block-level and file-level backups
Four Misunderstandings in disaster recovery