What is Hashing?
When implementing code or designing a system, hashing will be there. The fundamental knowledge of the hashing concept will make a big difference to design high quality systems and pass in the systems design interview.
In simple words, hashing in computer science is a way to quickly convert big pieces of information, like words or numbers, into smaller codes. It’s like using a special function or formula to create a secret fingerprint or unique identifier for the information. This helps in organizing and finding data faster, securing passwords, and other useful tasks in computer systems.
Where to Use Hashing?
We use Hashing in various areas. Here are a few simple examples:
Data retrieval: In Java and other programming languages, the key value data structure will use a hash algorithm. In Java, we use the HashSet or Hashtable classes for key value store. Then, we use a hash algorithm in the hashCode method to retrieve data more quickly.
Password storage: When you create an account on a website or an app, your password is often hashed and stored in a database. Hashing ensures that even if someone gains unauthorized access to the database, they can’t quickly retrieve the original passwords. When you log in, your entered password is hashed again, and the system checks if it matches the stored hash.
Digital signatures: Hashing plays a crucial role in verifying the authenticity and integrity of digital documents. A hash function is used to create a unique fingerprint of a document. If a single character in the document changes, the hash value will be completely different. This fingerprint can be used to ensure that the document has not been tampered with.
File integrity checks: Hashing ensures the integrity of files during transmission or storage. By calculating and comparing hash values before and after transferring a file, you can verify if the file has been modified or corrupted.
Caching: We commonly use Hashing in caching mechanisms. A cache is a temporary storage that holds frequently accessed data to improve performance. Hashing allows quick lookup and retrieval of cached data based on a unique key. By using hash functions, caches can efficiently store and retrieve data, reducing the need to access slower storage systems.
Database indexing: Databases often use hashing to create indexes for efficient retrieval. Hash-based indexes store keys and corresponding pointers to data records. This enables fast lookup operations, especially for equality-based searches, where the hash value is a direct pointer to the desired data.
Cryptographic applications: Hash functions are fundamental in cryptographic algorithms. They are used to create digital signatures, ensure message integrity, and provide secure storage. Hash functions in cryptography are designed to be one-way, meaning it is extremely difficult to reverse-engineer the original input from the hash value.
Data deduplication: Hashing is employed in data deduplication techniques to identify and eliminate duplicate data. By hashing the content of files or data blocks, identical instances can be easily identified and eliminated, leading to optimized storage utilization.
Load balancing: Hashing is utilized in load-balancing algorithms to distribute incoming requests across multiple servers. A hash function is applied to the request data, and the result is used to determine which server should handle the request. This helps evenly distribute the workload and improves system performance.
Digital Forensics: Hashing plays a significant role in digital forensics for data integrity verification, file identification, and password hash analysis. Hash values are computed for original files and used to detect changes or tampering during the forensic process. Hash functions are also employed to analyze password hash databases recovered from compromised systems.
Caching and Memoization: Hashing is used in caching mechanisms to store and retrieve frequently accessed data. By using the hash value of a key, the cache can quickly determine if the requested data is already available. Hashing is also used in memoization, where the results of expensive function calls are stored in a cache based on their input parameters.
These are just a few examples of where hashing can be used. The versatility, efficiency, and security properties of hashing make it an essential tool in various domains, including data management, security, cryptography, and performance optimization.
Practical Java Hash Example
Let’s see how the HashSet class uses hash in practice.
The following is the hashCode implementation of the String class. You don’t need to memorize or fully understand the following code but this is good for you to have an idea of what is hashCode. As mentioned before, a hashCode is used to enable data retrieval more quickly. It’s much faster to check one code instead of all the values of an object.
public final class String implements java.io.Serializable, Comparable<String>, CharSequence, Constable, ConstantDesc { public int hashCode() { if (h == 0 && !hashIsZero) { h = isLatin1() ? StringLatin1.hashCode(value) : StringUTF16.hashCode(value); if (h == 0) { hashIsZero = true; } else { hash = h; } } return h; } }
Let’s see how the hashCode is printed:
public class HashExample { public static void main(String[] args) { String name = "Duke"; System.out.println(name.hashCode()); } }
Output:
2141643
Server Selection
Imagine you have a big website that gets lots of visitors, and you want to distribute the workload across multiple servers to ensure smooth performance. Server selection hashing is a technique used to decide which server should handle each request.
Here’s how it works. Let’s say you have a bunch of servers available, labeled Server A, Server B, Server C, and so on.
When a user sends a request to your website, you want to determine which server should handle that request.
To make this decision, you use a hash function. A hash function takes some input, like the user’s IP address or a unique identifier, and produces a fixed-length hash value.
The hash value is then used to map the request to one of the servers. For example, if the hash value falls within a certain range, you might assign the request to Server A. If it falls within another range, you might assign it to Server B, and so on.
The idea behind server selection hashing is that the same input will always produce the same hash value. So, if a particular user sends multiple requests, they will consistently be directed to the same server, ensuring that their session remains intact.
By using server selection hashing, you can evenly distribute the incoming requests across multiple servers. This helps balance the workload, prevent overloading of individual servers, and ultimately provide a better user experience on your website.
Types of Hashing
Hashing is a fundamental concept in computer science and cryptography, used to transform data into a fixed-size value (hash value or hash code) that represents the original data. There are several types of hash functions and hashing algorithms available, each with its own characteristics and areas of application. Here are some commonly used types of hashing:
Cryptographic Hash Functions: These hash functions are primarily designed for data integrity and security. They produce a fixed-size hash value (typically 128, 160, 256, or 512 bits) and have properties like pre-image resistance, second pre-image resistance, and collision resistance. Examples of cryptographic hash functions include MD5 (Message Digest Algorithm 5), SHA-1 (Secure Hash Algorithm 1), SHA-256, and SHA-3.
Non-Cryptographic Hash Functions: These hash functions are generally faster but do not provide the same level of security as cryptographic hash functions. They are commonly used for hash tables, data indexing, checksums, and other non-security-critical applications. Examples include MurmurHash, Jenkins Hash, and Pearson Hash.
Message Digest Algorithms: These are a type of hash function that takes an input message of arbitrary length and produces a fixed-size hash value. Message Digest Algorithms are commonly used in digital signatures, password storage, and data integrity checks. Examples include MD5, SHA-1, and SHA-2.
Perfect Hash Functions: A perfect hash function is a hash function that maps distinct keys to unique hash values without any collisions. It ensures that no two different keys will have the same hash value. Perfect hash functions are useful in situations where efficient lookup of items in a set or dictionary is required.
Bloom Filters: A Bloom filter is a probabilistic data structure that uses a set of hash functions to test whether an element is a member of a set. Bloom filters provide fast membership tests but may have a small probability of false positives. They are commonly used in caching, spell-checking, and network routing.
Cryptographic Key Derivation Functions: These functions are used to derive cryptographic keys from a given input, such as a password or passphrase. Key derivation functions are designed to be computationally expensive and slow down potential attackers in brute-forcing passwords. Examples include PBKDF2 (Password-Based Key Derivation Function 2) and bcrypt.
These are just a few examples of the types of hashing algorithms and functions available. The choice of a particular hashing algorithm depends on the specific requirements of the application, such as security, speed, and memory constraints.
Consistent Hashing
Imagine you have a big website that uses multiple servers to handle user requests. Consistent hashing is a technique used to distribute the load evenly across these servers while minimizing disruptions when servers are added or removed.
Here’s how it works:
Instead of dividing the servers into fixed ranges like in regular hashing, consistent hashing uses a circular ring-like structure. Each server is represented by a point on the ring. These points are distributed evenly around the ring.
When a request comes in, a hash function is used to calculate a hash value for that request.
The hash value is then mapped onto the ring. Starting from that point, you move clockwise on the ring until you find the nearest server or the next available server point.
This server is responsible for handling the request. So, requests with similar hash values will be directed to the same or nearby servers.
The key idea behind consistent hashing is that when a server is added or removed, only a small portion of the keys or requests need to be remapped. This minimizes the disruption to the overall system. With regular hashing, adding or removing a server could require remapping a significant portion of the keys, which can be time-consuming and inefficient.
By using consistent hashing, you can achieve load balancing across servers while maintaining stability when servers are added or taken out of the system.
I hope this explanation helps you understand consistent hashing in simple terms!
Rendezvous Hashing
Rendezvous hashing, also known as highest random weight (HRW) hashing, is a technique used to determine which node or server should handle a particular request based on a predetermined set of nodes. It aims to evenly distribute the workload while maintaining stability when nodes are added or removed from the system.
Here’s how it works. Imagine you have a set of nodes or servers labeled A, B, C, and so on.
When a request comes in, a hash function is applied to both the request and each node in the system.
For each node, the hash function produces a hash value. The node with the highest hash value is chosen as the owner of that request.
The idea behind rendezvous hashing is that each request “rendezvous” with the node that generates the highest hash value for that request. It’s like the request and the node “meet” at the point with the highest value.
If a node is added or removed from the system, only the requests that would have been assigned to that specific node need to be remapped to a different node. This minimizes the disruption caused by node changes.
The strength of rendezvous hashing lies in its ability to evenly distribute the workload across nodes. Since the hash function takes both the request and node into account, it provides a balanced assignment of requests, ensuring that no single node becomes overwhelmed with requests while others remain idle.
Rendezvous hashing is particularly useful in scenarios where the set of nodes is relatively stable, and the focus is on load balancing and minimizing remapping when nodes change.
SHA Detailed Explanation
SHA (Secure Hash Algorithm) is a cryptographic hash function that takes an input (usually a message or data) and produces a fixed-size output called a hash value or digest. The purpose of the hash function is to ensure data integrity and provide a unique representation of the input.
There are several versions of SHA, such as SHA-1, SHA-256, SHA-384, and SHA-512, each with different output sizes. The most commonly used version is SHA-256, which produces a 256-bit hash value.
Here’s a high-level explanation of how SHA hashing works:
Pre-processing: The input data is processed to meet specific requirements. Padding is added to the input to make it a multiple of a fixed block size. Additional information, such as the length of the input, may also be appended.
Message Digest Initialization: The hash algorithm initializes a set of variables, called the initial hash values, which serve as the starting point for the hashing process. These values are predefined for each SHA variant.
Message Digest Computation: The input data is divided into multiple blocks, and the hash algorithm processes each block in sequence. For each block, logical and arithmetic operations are performed to transform the data and update the hash values.
Output: Once all blocks have been processed, the final hash value, the message digest, is obtained. The message digest is a fixed-length representation of the input data. Even a tiny change in the input data will likely produce a substantially different message digest.
SHA hashing is designed to have several properties:
Deterministic: Given the same input, the hash function will always produce the same output. This allows for easy verification of data integrity.
Fast computation: SHA hashing algorithms are designed to be efficiently computed on modern computer systems.
Pre-image resistance: Given a hash value, it should be computationally infeasible to determine the original input that produced that hash value.
Collision resistance: It should be extremely difficult to find two inputs producing the same hash value.
SHA hashes are widely used in various applications, such as password storage, digital signatures, and data integrity verification. They provide a reliable way to verify the integrity and authenticity of data without revealing the original input. However, it’s worth noting that some older versions of SHA, such as SHA-1, have known vulnerabilities and are no longer considered secure for specific applications. Using the newer and more secure SHA-256 or SHA-3 algorithms is generally recommended when possible.
Difference Between Hashing and Cryptography
Hashing and cryptography are related concepts but serve different purposes in computer science and information security. Here are the key differences between hashing and cryptography:
Purpose
Hashing: Hashing is primarily used for data integrity and verification. It takes an input (message or data) of any size and produces a fixed-size output called a hash value or digest. The main purpose of hashing is to ensure that the data has not been tampered with or modified during transmission or storage.
Cryptography: Cryptography involves the encryption and decryption of data to ensure confidentiality, integrity, authenticity, and non-repudiation. It encompasses various techniques, including encryption algorithms, key management, digital signatures, and more. Cryptography aims to protect data from unauthorized access and ensure secure communication.
Output
Hashing: The hashing output is a fixed-size hash value or digest, typically represented as a sequence of characters or numbers. The hash value uniquely represents the input data, and even a small change in the input will produce a significantly different hash value.
Cryptography: The output of cryptography depends on the specific cryptographic algorithm being used. It can involve encryption, where the original data is transformed into an unreadable form (ciphertext), and decryption, where the ciphertext is converted back to its original form (plaintext). Cryptography does not necessarily produce fixed-size outputs like hashing.
Reversibility
Hashing: Hash functions are designed to be one-way functions, meaning retrieving the original input data from the hash value is computationally infeasible. Hash functions are irreversible, and the original data cannot be obtained from the hash value.
Cryptography: Cryptographic algorithms can be reversible, allowing the original data to be retrieved from the encrypted form using the appropriate decryption key. Encryption and decryption are complementary operations in cryptography.
Key Usage
Hashing: Hash functions do not typically use keys. The same input data will always produce the same hash value. Hashing is deterministic and does not involve any secret information.
Cryptography: Cryptographic algorithms often involve using keys. Encryption requires a key to transform the original data into ciphertext, and decryption requires a corresponding key to revert the ciphertext to plaintext. The security of cryptographic systems relies on keeping the keys secret.
Security Goals
Hashing: The main security goal of hashing is data integrity. Hash functions are designed to detect even minor changes in the input data by producing a different hash value. They are not designed to provide confidentiality or protect against unauthorized access.
Cryptography: Cryptography encompasses multiple security goals, including confidentiality (ensuring the data remains secret), integrity (detecting data tampering), authenticity (verifying the identity of the sender), and non-repudiation (preventing the sender from denying their actions). Cryptographic techniques are designed to provide a comprehensive set of security services.
Difference Between Hashing and Encoding
Hashing and encoding are techniques used to transform data, but they serve different purposes and have distinct characteristics. Here’s a comparison between hashing and encoding:
Purpose
Hashing: The primary purpose of hashing is data integrity and security. Hash functions are designed to generate a fixed-size hash value (digest) that represents the original data. Hashing verifies data integrity, detects changes or tampering, and securely stores passwords.
Encoding: Encoding is primarily used for data representation and transformation. It is used to convert data from one format to another, such as converting text to binary or encoding special characters for safe transmission.
Reversibility:
Hashing: Hashing is a one-way process. Once data is hashed, it is computationally difficult (ideally, practically impossible) to reverse the process and obtain the original data from the hash value. Hash functions are designed to be irreversible, ensuring that the original data cannot be easily derived from the hash.
Encoding: Encoding is typically a reversible process. The encoded data can be decoded back to its original form using the appropriate decoding algorithm or scheme. The encoding and decoding operations are designed to be symmetrical.
Data Loss
Hashing: Hashing is a lossy process, meaning that the original data is not recoverable from the hash value. Hash functions condense the input data into a fixed-size output, discarding any excess information. Due to this property, hash functions can generate the same hash value for different input data (collisions).
Encoding: Encoding is generally a lossless process, ensuring that the original data can be fully recovered from the encoded representation. The encoding scheme maintains all the information of the original data during the transformation.
Security
Hashing: Hashing is commonly used for security purposes. Cryptographic hash functions, specifically designed for security, provide properties like pre-image resistance, second pre-image resistance, and collision resistance. Hashing is used in password storage, digital signatures, and data integrity checks.
Encoding: Encoding is not primarily designed for security. While some encoding schemes, like Base64, can provide a level of obfuscation, they are not intended to protect data from deliberate attacks or ensure data integrity.
Examples
Hashing: Examples of hashing algorithms include MD5, SHA-1, SHA-256, and bcrypt. These algorithms are commonly used for data integrity checks, password storage, and cryptographic protocols.
Encoding: Examples of encoding schemes include Base64, UTF-8 encoding, URL encoding, and HTML encoding. These schemes are used to transform data between different representations or ensure safe data transmission.
In summary, hashing is a one-way, irreversible process primarily used for data integrity and security, while encoding is a reversible data representation and transformation process. Hashing focuses on generating fixed-size hash values for security purposes, while encoding preserves the original data and allows for recovery.
Conclusion
In conclusion, hashing is a useful tool in computer science that helps ensure the integrity of data, securely store passwords, and quickly retrieve information. By converting data into fixed-size hash codes, hashing allows us to verify if data has been tampered with, protect passwords from being easily stolen, and efficiently locate information in large datasets. Although there are some limitations and potential vulnerabilities, hashing remains an essential and widely used technique in various applications.