Principles of Hashing in Database Management Systems

Hashing was initially a confusing topic for me during loksewa preparation. The concept seemed simple, but understanding collision handling and different hashing techniques took some time. See, let me share what i've learned about hashing principles in dbms. Honestly, this took me forever to get.

Introduction to Hashing

Now this is where I got confused. Hashing is a technique used to map data of arbitrary size to fixed-size values using a hash function. In databases, hashing is primarily used for fast data retrieval, indexing, and storage organization. I was worried about this topic.

This showed up in my loksewa exam too.

Here's how I like to think about it: Hashing is like having a filing system where instead of searching through every file alphabetically, you use a formula to calculate exactly which drawer contains the file you need. Don't overthink this one.

Basic Hashing Concepts

Hash Function

A mathematical function that converts a key into a hash value (address).

I finally understood this during revision week. Properties of a good hash function: Deterministic: Same input always produces same output. Uniform distribution: Spreads keys evenly across hash table.

Fast computation: Quick to calculate
Avalanche effect: Small input change causes large output change

Hash Table

An array of fixed size where data is stored based on hash values.

This is actually simpler than it sounds. ### Key Components Key: The data item used for hashing (e.g., student ID, employee number).

Hash Value: Result of applying hash function to the key Bucket: Storage location in hash table.
Load Factor: Ratio of filled slots to total slots Don't overthink this one.

Types of Hash Functions

1. Division Method

Formula: h(k) = k mod m

I made flashcards for this - that's how I remembered it. Where: k = key value.

m = size of hash table (preferably prime number)

Example:

Key = 1234, Hash table size = 10
h(1234) = 1234 mod 10 = 4
Store at index 4

Advantages:

Simple to implement Fast computation.

Disadvantages: Poor distribution if m isn't chosen carefully.

Clustering issues

2. Multiplication Method

Formula: h(k) = floor(m × (k × A mod 1))

My study group helped me figure this out. Where:

A is a sort of constant (0 < A < 1)
Knuth suggests A = (√5 - 1)/2 ≈ 0.618

Advantages: Value of m isn't critical. Better distribution than division method.

Disadvantages:

More complex computation
Requires floating-point arithmetic

3. Mid-Square Method

Process:

Square the key
Extract middle digits
Use as hash value

Example:

Key = 123
Square = 123² = 15129
Middle digits = 512 (if we want 3 digits)
Hash value = 512 mod table_size

4. Folding Method

Process:

Divide key into equal parts
Add the parts together
Use sum as hash value

Example:

Key = 123456789
Divide: 123 | 456 | 789
Sum = 123 + 456 + 789 = 1368
Hash value = 1368 mod table_size

Collision Resolution Techniques

Collision: When two different keys produce the same hash value.

1. Open Addressing (Closed Hashing)

Linear Probing

Method: If collision occurs, check next slot sequentially.

Formula: h'(k) = (h(k) + i) mod m, where i = 0, 1, 2, ...

Example:

Hash table size = 7
Insert keys: 10, 22, 31, 4, 15

h(10) = 10 mod 7 = 3 → Store at index 3
h(22) = 22 mod 7 = 1 → Store at index 1  
h(31) = 31 mod 7 = 3 well, → Collision! Try index 4 → Store at index 4
h(4) = 4 mod 7 = 4 → Collision! Try index 5 → Store at index 5
h(15) = 15 mod 7 = 1 → Collision! Try index 2 → Store at index 2

Advantages: Simple implementation.

Good cache performance

Disadvantages: Primary clustering.

Performance degrades as table fills

Quadratic Probing

Method: Use quadratic function to find next slot.

Formula: h'(k) = (h(k) + i²) mod m, where i = 0, 1, 2, ...

Advantages: Reduces primary clustering.

Better distribution than linear probing

Disadvantages:

Secondary clustering May not find empty slot even if available.

Double Hashing

Method: Use second hash function to determine step size.

Formula: h'(k) = (h₁(k) + i × h₂(k)) mod m

Example:

h₁(k) = k mod 7
h₂(k) = 5 - (k mod 5)

Advantages: Eliminates clustering.

Good distribution

Disadvantages:

More complex implementation
Requires two hash functions

2. Closed Addressing (Open Hashing)

Separate Chaining

Method: Each hash table slot contains a linked list of all keys that hash to that slot.

Example:

Hash table with chaining:
Index 0: NULL
Index 1: [22] → [15] → NULL
Index 2: NULL  
Index 3: [10] → [31] → NULL
Index 4: [4] → NULL
``` This is easier than it looks.

**Advantages**:
Simple implementation.
No clustering.
- Table can exceed load factor of 1

**Disadvantages**:
- Extra memory for pointers
Poor cache performance.
- Requires dynamic memory allocation

## **Hashing in Database Applications**

### **1. Hash Indexes**
Fast equality searches (O(1) average case).
- Not suitable for range queries
Used in hash-based join algorithms.

### **2. Hash Partitioning**
- Distribute data across multiple storage basically, devices
- Based on hash value of partition key
Enables parallel processing.

### **3. Hash Joins**
Build hash table for smaller relation.
- Probe with larger relation
- Efficient for equi-joins

### **4. Hash-based Aggregation**
- Group records by hash value of grouping attributes
Efficient for GROUP BY operations.

## **Performance Analysis**

### **Load Factor**
**α = n/m**
Where:
n = number of keys.
- m = hash table size

**Impact on performance**:
- α < 0.7: Good performance
α > 0.8: Performance degrades significantly.

### **Expected Number of Probes**

**Linear Probing**:
- Successful search: ½(1 + 1/(1-α))
Unsuccessful search: ½(1 + 1/(1-α)²).

**Separate Chaining**:
- Successful search: 1 + α/2
Unsuccessful search: α.

## you know, **Advantages and Disadvantages of Hashing**

### **Advantages**
- **Fast access**: O(1) average case for search, insert, delete
- **Simple implementation**: Basic concept is straightforward
**Space efficient**: No wasted space (with good load factor).
**Suitable for equality searches**: Perfect for exact match queries.

### **Disadvantages**
- **No ordering**: Cannot perform range queries efficiently
- **Collision handling**: Requires additional complexity

I bombed this topic in my first practice test.
- **Hash function dependency**: Performance depends on hash function quality
**Worst case O(n)**: When all keys hash to same value.

## **How I Remembered This Stuff**

**Key concept**: Hash function maps keys to addresses for fast access.
- **Collision resolution**: Open addressing (probing) vs Closed addressing (chaining)
- **Load factor**: Keep below 0.7-0.8 for good performance
**Applications**: Indexes, joins, partitioning, aggregation.
- **Remember**: Good for equality, bad for range queries I found a trick that really works.

### **Questions from My Experience**

During my exam prep, I noticed these questions keep showing up:

I got this wrong so many times in practice.

1. **"What is the main advantage of hashing over other indexing methods?"**
   - Answer: O(1) average time complexity for search, insert, and delete operations
   - *Tip: Emphasize the constant time access*

2. **"What happens when two keys produce the same hash value?"**
   - Answer: Collision occurs, which must be resolved using techniques like chaining or open addressing
   - *Tip: Know at least two collision resolution methods* I almost got this wrong in my exam.

3. **"Calculate hash value for key 1234 using division method with table size 11"**
   - Answer: h(1234) = 1234 mod 11 = 2
   - *Tip: Practice these calculations* I found a trick that really works.

4. **"What is the difference between linear probing and separate chaining?"**
   - Answer: Linear probing finds next empty slot; chaining uses linked lists at each slot
   - *Tip: Know the trade-offs of each method* I felt proud when I solved this.

5. **"Why is hashing not suitable for range queries?"**
   - Answer: Hash function destroys ordering, so consecutive keys may be stored far apart
   - *Tip: This is a key limitation to remember*

**Pro tip from my experience**: When solving hashing problems, always consider the load factor and collision resolution method. These two factors significantly impact performance. Also, remember that hashing is excellent for equality searches but terrible for range queries - this distinction often appears in exams.