mirror of
https://github.com/kristoferssolo/Databases-II-Cheatsheet.git
synced 2025-10-21 18:20:35 +00:00
v2
This commit is contained in:
parent
d7d61184ea
commit
c0b7597c98
BIN
img/bitmap.png
Normal file
BIN
img/bitmap.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 42 KiB |
303
main.typ
303
main.typ
@ -1,17 +1,17 @@
|
|||||||
#set page(margin: (
|
#set page(margin: (
|
||||||
top: 1cm,
|
top: 0.6cm,
|
||||||
bottom: 1cm,
|
bottom: 0.6cm,
|
||||||
right: 1cm,
|
right: 0.6cm,
|
||||||
left: 1cm,
|
left: 0.6cm,
|
||||||
))
|
))
|
||||||
|
|
||||||
#set text(7pt)
|
#set text(6.2pt)
|
||||||
#show heading: it => {
|
#show heading: it => {
|
||||||
if it.level == 1 {
|
if it.level == 1 {
|
||||||
// pagebreak(weak: true)
|
// pagebreak(weak: true)
|
||||||
text(10pt, upper(it))
|
text(8.5pt, upper(it))
|
||||||
} else if it.level == 2 {
|
} else if it.level == 2 {
|
||||||
text(9pt, smallcaps(it))
|
text(8pt, smallcaps(it))
|
||||||
} else {
|
} else {
|
||||||
text(8pt, smallcaps(it))
|
text(8pt, smallcaps(it))
|
||||||
}
|
}
|
||||||
@ -22,90 +22,225 @@
|
|||||||
|
|
||||||
== Bitmap
|
== Bitmap
|
||||||
|
|
||||||
|
Each bit in a bitmap corresponds to a possible item or condition, with a bit
|
||||||
|
set to 1 indicating presence or true, and a bit set to 0 indicating absence or
|
||||||
|
false.
|
||||||
|
|
||||||
|
#figure(
|
||||||
|
image("img/bitmap.png", width: 30%)
|
||||||
|
)
|
||||||
|
|
||||||
== B+ tree
|
== B+ tree
|
||||||
|
|
||||||
|
*B+ tree* is a type of self-balancing tree data structure that maintains data
|
||||||
|
sorted and allows searches, sequential access, insertions, and deletions in
|
||||||
|
logarithmic time. It is an extension of the B-tree and is extensively used in
|
||||||
|
databases and filesystems for indexing. B+ tree is *Balanced*; Order (n):
|
||||||
|
Defined such that each node (except root) can have at most $n$ children
|
||||||
|
(pointers) and at least $⌈n/2⌉$ children; *Internal nodes hold* between
|
||||||
|
⌈n/2⌉−1⌈n/2⌉−1 and n−1n−1 keys; Leaf nodes also hold between $⌈n/2⌉−1$ and
|
||||||
|
$n−1$ keys but also store all data values corresponding to the keys; *Leaf
|
||||||
|
Nodes Linked*: Leaf nodes are linked together, making range queries and
|
||||||
|
sequential access very efficient.
|
||||||
|
|
||||||
|
- *Insert (key, data)*:
|
||||||
|
- Insert key in the appropriate leaf node in sorted order;
|
||||||
|
- If the node overflows (more than $n−1$ keys), split it, add the middle
|
||||||
|
key to the parent, and adjust pointers;
|
||||||
|
+ Leaf split: $1$ to $ceil(frac(n,2)) $ and $ceil(frac(n,2)) + 1 $ to
|
||||||
|
$n$ as two leafs. Promote the lowest from the 2nd one.
|
||||||
|
+ Node split: $1$ to $ceil(frac(n+1, 2)) - 1 $ and $ceil(frac(n,2)) + 1 $.
|
||||||
|
$ceil(frac(n+1, 2)) - 1 $ gets moved up.
|
||||||
|
- If a split propagates to the root and causes the root to overflow, split
|
||||||
|
the root and create a new root. Note: root can contain less than
|
||||||
|
$ceil(frac(n,2)) - 1$ keys.
|
||||||
|
- *Delete (key)*:
|
||||||
|
- Remove the key from the leaf node.
|
||||||
|
- If the node underflows (fewer than $⌈n/2⌉−1$ keys), keys and pointers are
|
||||||
|
redistributed or nodes are merged to maintain minimum occupancy. -
|
||||||
|
Adjustments may propagate up to ensure all properties are maintained.
|
||||||
|
|
||||||
== Hash-index
|
== Hash-index
|
||||||
|
|
||||||
|
*Hash indices* are a type of database index that uses a hash function to
|
||||||
|
compute the location (hash value) of data items for quick retrieval. They are
|
||||||
|
particularly efficient for equality searches that match exact values.
|
||||||
|
|
||||||
|
*Hash Function*: A hash function takes a key (a data item's attribute used for
|
||||||
|
indexing) and converts it into a hash value. This hash value determines the
|
||||||
|
position in the hash table where the corresponding record's pointer is stored.
|
||||||
|
*Hash Table*: The hash table stores pointers to the actual data records in the
|
||||||
|
database. Each entry in the hash table corresponds to a potential hash value
|
||||||
|
generated by the hash function.
|
||||||
|
|
||||||
= Algorithms
|
= Algorithms
|
||||||
|
|
||||||
|
|
||||||
== Nested-loop join
|
== Nested-loop join
|
||||||
|
|
||||||
=== Overview
|
*Nested Loop Join*: A nested loop join is a database join operation where each
|
||||||
|
tuple of the outer table is compared against every tuple of the inner table to
|
||||||
|
find all pairs of tuples which satisfy the join condition. This method is
|
||||||
|
simple but can be inefficient for large datasets due to its high computational
|
||||||
|
cost.
|
||||||
|
|
||||||
=== Cost
|
```python
|
||||||
|
Simplified version (to get the idea)
|
||||||
|
for each tuple tr in r: (for each tuple ts in s: test pair (tr, ts))
|
||||||
|
```
|
||||||
|
|
||||||
|
// TODO: Add seek information
|
||||||
|
Block transfer cost: $n_r ∗ b_s + b_r$ block transfers would be required,
|
||||||
|
where $b_r$ -- blocks in relation$r$, same for $s$.
|
||||||
|
|
||||||
== Block-nested join
|
== Block-nested join
|
||||||
|
|
||||||
=== Overview
|
*Block Nested Loop Join*: A block nested loop join is an optimized version of the
|
||||||
|
nested loop join that reads and holds a block of rows from the outer table in
|
||||||
|
memory and then loops through the inner table, reducing the number of disk
|
||||||
|
accesses and improving performance over a standard nested loop join, especially
|
||||||
|
when indices are not available.
|
||||||
|
|
||||||
=== Cost
|
|
||||||
|
```python
|
||||||
|
Simplified version (to get the idea)
|
||||||
|
for each block Br of r: for each block Bs of s:
|
||||||
|
for each tuple tr in r: (for each tuple ts in s: test pair (tr, ts))
|
||||||
|
```
|
||||||
|
|
||||||
|
// TODO: Add seek information
|
||||||
|
Block transfer cost: $b_r ∗ b_s + b_r$, $b_r$ -- blocks in relation $r$, same
|
||||||
|
for $s$.
|
||||||
|
|
||||||
== Merge join
|
== Merge join
|
||||||
|
|
||||||
=== Overview
|
*Merge Join*: A merge join is a database join operation where both the outer
|
||||||
|
and inner tables are first sorted on the join key, and then merged together by
|
||||||
|
sequentially scanning through both tables to find matching pairs. This method
|
||||||
|
is highly efficient when the tables are *already sorted* or can be *sorted
|
||||||
|
quickly*, minimizes random disk access. Merge-join method is efficient; the
|
||||||
|
number of block transfers is equal to the sum of the number of blocks in both
|
||||||
|
files, $b_r + b_s$.
|
||||||
|
Assuming that $bb$ buffer blocks are allocated to each relation, the number of disk
|
||||||
|
seeks required would be $⌈b_r∕b_b⌉+ ⌈b_s∕b_b⌉$ disk seeks
|
||||||
|
|
||||||
|
+ Sort Both Tables: If not already sorted, the outer table and the inner table
|
||||||
|
are sorted based on the join keys.
|
||||||
|
+ Merge: Once both tables are sorted, the algorithm performs a merging
|
||||||
|
operation similar to that used in merge sort:
|
||||||
|
+ Begin with the first record of each table.
|
||||||
|
+ Compare the join keys of the current records from both tables.
|
||||||
|
+ If the keys match, join the records and move to the next record in both tables.
|
||||||
|
+ If the join key of the outer table is smaller, move to the next record in
|
||||||
|
the outer table.
|
||||||
|
+ If the join key of the inner table is smaller, move to the next record in
|
||||||
|
the inner table.
|
||||||
|
+ Continue this process until all records in either table have been examined.
|
||||||
|
+ Output the Joined Rows;
|
||||||
|
|
||||||
|
|
||||||
=== Cost
|
|
||||||
|
|
||||||
== Hash-join
|
== Hash-join
|
||||||
|
|
||||||
=== Overview
|
*Hash Join*: A hash join is a database join operation that builds an in-memory
|
||||||
|
hash table using the join key from the smaller, often called the build table,
|
||||||
|
and then probes this hash table using the join key from the larger, or probe
|
||||||
|
table, to find matching pairs. This technique is very efficient for *large
|
||||||
|
datasets* where *indexes are not present*, as it reduces the need for nested
|
||||||
|
loops.
|
||||||
|
|
||||||
=== Cost
|
- $h$ is a hash function mapping JoinAttrs values to ${0, 1, … , n_h}$, where
|
||||||
|
JoinAttrs denotes the common attributes of r and s used in the natural join.
|
||||||
|
- $r_0$, $r_1$, … , rnh denote partitions of r tuples, each initially empty.
|
||||||
|
Each tuple $t_r in r$ is put in partition $r_i$, where $i = h(t_r [#[JoinAttrs]])$.
|
||||||
|
- $s_0$, $s_1$, ..., $s_n_h$ denote partitions of s tuples, each initially empty.
|
||||||
|
Each tuple $t_s in s$ is put in partition $s_i$, where $i = h(t_s [#[JoinAttrs]])$.
|
||||||
|
|
||||||
|
Cost of block transfers: $3(b_r + b_s) + 4 n_h$. The hash join thus requires
|
||||||
|
$2(⌈b_r∕b_b⌉+⌈b_s∕b_b⌉)+ 2n_h$ seeks.
|
||||||
|
|
||||||
|
$b_b$ blocks are allocated for the input buffer and each output buffer.
|
||||||
|
|
||||||
|
+ Build Phase:
|
||||||
|
+ Choose the smaller table (to minimize memory usage) as the "build table."
|
||||||
|
+ Create an in-memory hash table. For each record in the build table,
|
||||||
|
compute a hash on the join key and insert the record into the hash table
|
||||||
|
using this hash value as an index.
|
||||||
|
+ Probe Phase:
|
||||||
|
+ Take each record from the larger table, which is often referred to as the
|
||||||
|
"probe table."
|
||||||
|
+ Compute the hash on the join key (same hash function used in the build
|
||||||
|
phase).
|
||||||
|
+ Use this hash value to look up in the hash table built from the smaller
|
||||||
|
table.
|
||||||
|
+ If the bucket (determined by the hash) contains any entries, check each
|
||||||
|
entry to see if the join key actually matches the join key of the record
|
||||||
|
from the probe table (since hash functions can lead to collisions).
|
||||||
|
+ Output the Joined Rows.
|
||||||
|
|
||||||
|
|
||||||
= Relational-algebra
|
= Relational-algebra
|
||||||
|
|
||||||
== Equivalence rules
|
== Equivalence rules
|
||||||
|
|
||||||
- *Commutativity*: $R∪S=S∪R$; Intersection: $R∩S=S∩R$; Join: $R join S=S
|
|
||||||
join R$; Selection : $ sigma p_1( sigma p_2(R))= sigma p_2( sigma p_1(R))$.
|
|
||||||
- *Associativity*: $(R∪S)∪T=R∪(S∪T)$; Intersection: $(R∩S)∩T=R∩(S∩T)$;
|
|
||||||
Join: $(R join S) join T=R join (S join T)$; Theta joins are associative in
|
|
||||||
the following manner: $(E_1 join_theta_1 E_2) join_(theta_2 and theta_3)
|
|
||||||
E_3 ≡E_1 join_(theta_1 or theta_3) (E_2 join_theta_2 E_3)$
|
|
||||||
- *Distributivity*: Distributivity of Union over Intersection:
|
|
||||||
$R∪(S∩T)=(R∪S)∩(R∪T)$; Intersection over Union: $R∩(S∪T)=(R∩S)∪(R∩T)$ Join over
|
|
||||||
Union: $R join (S∪T)=(R join S)∪(R join T)$; Selection Over Union:
|
|
||||||
$ sigma p(R∪S)= sigma p(R)∪ sigma p(S)$; Projection Over Union: $pi c(R∪S)=pi c(R)∪pi c(S)$;
|
|
||||||
- Selection and Join Commutativity: $ sigma p(R join S)= sigma p(R) join S$ if
|
|
||||||
p involves only attributes of R
|
|
||||||
- Pushing Selections Through Joins: $ sigma p(R join S)=( sigma p(R)) join S$
|
|
||||||
when p only involves attributes of R
|
|
||||||
- Pushing Projections Through Joins: $pi c(R join S)=pi c(pi_(c sect #[attr])
|
|
||||||
(R) join pi_(c sect #[attr]) (S))$
|
|
||||||
|
|
||||||
== Operations
|
// FROM Database concepts
|
||||||
|
+ $σ_(θ_1∧θ_2)(E) ≡σ_(θ_1) (σ_(θ_2)(E))$
|
||||||
|
+ $σ_(θ_1)(σ_(θ_2)(E)) ≡σ_(θ_2)(σ_(θ_1)(E))$
|
||||||
|
+ $Π_(L_1)(Π_(L_2)(… (Π_(L_n)(E)) …)) ≡Π_(L_1)(E)$ -- only the last one matters.
|
||||||
|
+ Selections can be combined with Cartesian products and theta joins: $σ_θ(E_1
|
||||||
|
× E_2) ≡E_1 ⋈_θ E_2$ - This expression is just the definition of the theta
|
||||||
|
join |||| $σ_(θ_1)(E_1 ⋈_(θ_2) E_2) ≡E_1 ⋈_(θ_1) ∧ θ_2 E_2$
|
||||||
|
+ $E_1 ⋈_θ E_2 ≡E_2 ⋈_θ E_1$
|
||||||
|
+ Join associativity: $(E_1 ⋈ E_2) ⋈ E_3 ≡E_1 ⋈(E_2 ⋈E_3)$ |||| $(E_1 join_theta_1
|
||||||
|
E_2) join_(theta_2 and theta_3) |||| E_3 ≡E_1 join_(theta_1 or theta_3) (E_2
|
||||||
|
join_theta_2 E_3)$
|
||||||
|
+ Selection distribution: $σ_(θ_1)(E_1 ⋈_θ E_2) ≡(σ_(θ_1) (E_1)) ⋈_θ E_2$;
|
||||||
|
$σ_(θ_1∧θ_2)(E_1 ⋈_θ E_2) ≡(σ_(θ_1)(E_1)) ⋈_θ (σ_(θ_2)(E_2))$
|
||||||
|
+ Projection distribution: - $Π_(L_1∪L_2) (E_1 ⋈_θ E_2) ≡(Π_(L_1(E_1)) ⋈_θ
|
||||||
|
(Π_(L_2)(E_2))$ |||| $Π(L_1∪L_2) (E_1 ⋈_θ E_2) ≡Π_(L_1∪L_2) ((Π_(L_1∪L_3) (E_1))
|
||||||
|
⋈_θ (Π_(L_2∪L_4) (E_2)))$
|
||||||
|
+ Union and intersection commmutativity: E1 ∪E2 ≡E2 ∪E1 |||| - E1 ∩E2 ≡E2 ∩E1
|
||||||
|
+ Set union and intersection are associative: (E1 ∪E2) ∪E3 ≡E1 ∪(E2 ∪E3) |||| (E1
|
||||||
|
∩E2) ∩E3 ≡E1 ∩(E2 ∩E3);
|
||||||
|
+ The selection operation distributes over the union, intersection, and
|
||||||
|
set-difference operations: σθ(E1 ∪E2) ≡σθ(E1) ∪σθ(E2) |||| σθ(E1 ∩E2) ≡σθ(E1)
|
||||||
|
∩σθ(E2) |||| σθ(E1 −E2) ≡σθ(E1) −σθ(E2) |||| σθ(E1 ∩E2) ≡σθ(E1) ∩E2 |||| σθ(E1 −E2) ≡σθ(E1)
|
||||||
|
−E2 |||| 12.
|
||||||
|
+ The projection operation distributes over the union operation - ΠL(E1
|
||||||
|
∪E2) ≡(ΠL(E1)) ∪(ΠL(E2))
|
||||||
|
|
||||||
- Projection ($pi$). Syntax: $pi_{#[attributes]}(R)$. Purpose: Reduces the
|
// == Operations
|
||||||
relation to only contain specified attributes. Example: $pi_{#[Name,
|
//
|
||||||
Age}]}(#[Employees])$
|
// - Projection ($pi$). Syntax: $pi_{#[attributes]}(R)$. Purpose: Reduces the
|
||||||
|
// relation to only contain specified attributes. Example: $pi_{#[Name,
|
||||||
- Selection ($sigma$). Syntax: $sigma_{#[condition]}(R)$. Purpose: Filters rows
|
// Age}]}(#[Employees])$
|
||||||
that meet the condition. Example: $sigma_{#[Age] > 30}(#[Employees])$
|
//
|
||||||
|
// - Selection ($sigma$). Syntax: $sigma_{#[condition]}(R)$. Purpose: Filters rows
|
||||||
- Union ($union$). Syntax: $R union S$. Purpose: Combines tuples from both
|
// that meet the condition. Example: $sigma_{#[Age] > 30}(#[Employees])$
|
||||||
relations, removing duplicates. Requirement: Relations must be
|
//
|
||||||
union-compatible.
|
// - Union ($union$). Syntax: $R union S$. Purpose: Combines tuples from both
|
||||||
|
// relations, removing duplicates. Requirement: Relations must be
|
||||||
- Intersection ($sect$). Syntax: $R sect S$. Purpose: Retrieves tuples common
|
// union-compatible.
|
||||||
to both relations. Requirement: Relations must be union-compatible.
|
//
|
||||||
|
// - Intersection ($sect$). Syntax: $R sect S$. Purpose: Retrieves tuples common
|
||||||
- Difference ($-$). Syntax: $R - S$. Purpose: Retrieves tuples in R that are
|
// to both relations. Requirement: Relations must be union-compatible.
|
||||||
not in S. Requirement: Relations must be union-compatible.
|
//
|
||||||
|
// - Difference ($-$). Syntax: $R - S$. Purpose: Retrieves tuples in R that are
|
||||||
- Cartesian Product ($times$). Syntax: $R times S$. Purpose: Combines tuples
|
// not in S. Requirement: Relations must be union-compatible.
|
||||||
from R with every tuple from S.
|
//
|
||||||
|
// - Cartesian Product ($times$). Syntax: $R times S$. Purpose: Combines tuples
|
||||||
- Natural Join ($join$). Syntax: $R join S$. Purpose: Combines tuples from R
|
// from R with every tuple from S.
|
||||||
and S based on common attribute values.
|
//
|
||||||
|
// - Natural Join ($join$). Syntax: $R join S$. Purpose: Combines tuples from R
|
||||||
- Theta Join ($join_theta$). Syntax: $R join_theta S$. Purpose: Combines tuples
|
// and S based on common attribute values.
|
||||||
from R and S where the theta condition holds.
|
//
|
||||||
|
// - Theta Join ($join_theta$). Syntax: $R join_theta S$. Purpose: Combines tuples
|
||||||
- Full Outer Join: $R join.l.r S$. Left Outer Join: $R join.l S$.
|
// from R and S where the theta condition holds.
|
||||||
Right Outer Join: $R join.r S$. Purpose: Extends join to include non-matching
|
//
|
||||||
tuples from one or both relations, filling with nulls.
|
// - Full Outer Join: $R join.l.r S$. Left Outer Join: $R join.l S$.
|
||||||
|
// Right Outer Join: $R join.r S$. Purpose: Extends join to include non-matching
|
||||||
|
// tuples from one or both relations, filling with nulls.
|
||||||
|
|
||||||
|
|
||||||
= Concurrency
|
= Concurrency
|
||||||
@ -171,7 +306,9 @@ conflict serializable.
|
|||||||
- *Read committed* allows only committed data to be read, but does not require re- peatable reads.
|
- *Read committed* allows only committed data to be read, but does not require re- peatable reads.
|
||||||
- *Read uncommitted* allows uncommitted data to be read. Lowest isolation level allowed by SQL.
|
- *Read uncommitted* allows uncommitted data to be read. Lowest isolation level allowed by SQL.
|
||||||
|
|
||||||
== Schedule
|
|
||||||
|
|
||||||
|
== Protocols
|
||||||
|
|
||||||
We say that a schedule S is *legal* under a given locking protocol if S is a possible
|
We say that a schedule S is *legal* under a given locking protocol if S is a possible
|
||||||
schedule for a set of transactions that follows the rules of the locking protocol. We say
|
schedule for a set of transactions that follows the rules of the locking protocol. We say
|
||||||
@ -179,27 +316,47 @@ that a locking protocol ensures conflict serializability if and only if all lega
|
|||||||
are *conflict serializable*; in other words, for all legal schedules the associated →relation
|
are *conflict serializable*; in other words, for all legal schedules the associated →relation
|
||||||
is acyclic.
|
is acyclic.
|
||||||
|
|
||||||
== Protocols
|
|
||||||
|
|
||||||
=== Lock-based
|
=== Lock-based
|
||||||
|
|
||||||
==== Dealock
|
==== 2-phased lock protocol
|
||||||
|
|
||||||
|
*The Two-Phase Locking (2PL)* Protocol is a concurrency control method used in
|
||||||
|
database systems to ensure serializability of transactions. The protocol
|
||||||
|
involves two distinct phases: *Locking Phase (Growing Phase):* A transaction
|
||||||
|
may acquire locks but cannot release any locks. During this phase, the
|
||||||
|
transaction continues to lock all the resources (data items) it needs to
|
||||||
|
execute. \ *Unlocking Phase (Shrinking Phase):* The transaction releases locks
|
||||||
|
and cannot acquire any new ones. Once a transaction starts releasing locks, it
|
||||||
|
moves into this phase until all locks are released.
|
||||||
|
|
||||||
|
==== Problems of locks
|
||||||
|
|
||||||
*Deadlock* is a condition where two or more tasks are each waiting for the
|
*Deadlock* is a condition where two or more tasks are each waiting for the
|
||||||
other to release a resource, or more than two tasks are waiting for resources
|
other to release a resource, or more than two tasks are waiting for resources
|
||||||
in a circular chain.
|
in a circular chain.
|
||||||
|
\ *Starvation* (also known as indefinite blocking) occurs
|
||||||
==== Starvation
|
when a process or thread is perpetually denied necessary resources to process
|
||||||
|
its work. Unlike deadlock, where everything halts, starvation only affects some
|
||||||
*Starvation* (also known as indefinite blocking) occurs when a process or
|
while others progress.
|
||||||
thread is perpetually denied necessary resources to process its work. Unlike
|
|
||||||
deadlock, where everything halts, starvation only affects some while others
|
|
||||||
progress.
|
|
||||||
|
|
||||||
=== Timestamp-based
|
=== Timestamp-based
|
||||||
|
|
||||||
|
*Timestamp Assignment:* Each transaction is given a unique timestamp when it
|
||||||
|
starts. This timestamp determines the transaction's temporal order relative to
|
||||||
|
others. *Read Rule:* A transaction can read an object if the last write
|
||||||
|
occurred by a transaction with an earlier or the same timestamp. *Write Rule:*
|
||||||
|
A transaction can write to an object if the last read and the last write
|
||||||
|
occurred by transactions with earlier or the same timestamps.
|
||||||
|
|
||||||
=== Validation-based
|
=== Validation-based
|
||||||
|
|
||||||
|
Assumes that conflicts are rare and checks for them only at the end of a transaction.
|
||||||
|
*Working Phase:* Transactions execute without acquiring locks, recording all
|
||||||
|
data reads and writes. *Validation Phase:* Before committing, each transaction
|
||||||
|
must validate that no other transactions have modified the data it accessed.
|
||||||
|
*Commit Phase:* If the validation is successful, the transaction commits and
|
||||||
|
applies its changes. If not, it rolls back and may be restarted.
|
||||||
|
|
||||||
=== Version isolation
|
=== Version isolation
|
||||||
|
|
||||||
= Logs
|
= Logs
|
||||||
|
|||||||
Loading…
Reference in New Issue
Block a user