mirror of
https://github.com/kristoferssolo/Databases-II-Cheatsheet.git
synced 2025-10-21 18:20:35 +00:00
v2
This commit is contained in:
parent
d7d61184ea
commit
c0b7597c98
BIN
img/bitmap.png
Normal file
BIN
img/bitmap.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 42 KiB |
303
main.typ
303
main.typ
@ -1,17 +1,17 @@
|
||||
#set page(margin: (
|
||||
top: 1cm,
|
||||
bottom: 1cm,
|
||||
right: 1cm,
|
||||
left: 1cm,
|
||||
top: 0.6cm,
|
||||
bottom: 0.6cm,
|
||||
right: 0.6cm,
|
||||
left: 0.6cm,
|
||||
))
|
||||
|
||||
#set text(7pt)
|
||||
#set text(6.2pt)
|
||||
#show heading: it => {
|
||||
if it.level == 1 {
|
||||
// pagebreak(weak: true)
|
||||
text(10pt, upper(it))
|
||||
text(8.5pt, upper(it))
|
||||
} else if it.level == 2 {
|
||||
text(9pt, smallcaps(it))
|
||||
text(8pt, smallcaps(it))
|
||||
} else {
|
||||
text(8pt, smallcaps(it))
|
||||
}
|
||||
@ -22,90 +22,225 @@
|
||||
|
||||
== Bitmap
|
||||
|
||||
Each bit in a bitmap corresponds to a possible item or condition, with a bit
|
||||
set to 1 indicating presence or true, and a bit set to 0 indicating absence or
|
||||
false.
|
||||
|
||||
#figure(
|
||||
image("img/bitmap.png", width: 30%)
|
||||
)
|
||||
|
||||
== B+ tree
|
||||
|
||||
*B+ tree* is a type of self-balancing tree data structure that maintains data
|
||||
sorted and allows searches, sequential access, insertions, and deletions in
|
||||
logarithmic time. It is an extension of the B-tree and is extensively used in
|
||||
databases and filesystems for indexing. B+ tree is *Balanced*; Order (n):
|
||||
Defined such that each node (except root) can have at most $n$ children
|
||||
(pointers) and at least $⌈n/2⌉$ children; *Internal nodes hold* between
|
||||
⌈n/2⌉−1⌈n/2⌉−1 and n−1n−1 keys; Leaf nodes also hold between $⌈n/2⌉−1$ and
|
||||
$n−1$ keys but also store all data values corresponding to the keys; *Leaf
|
||||
Nodes Linked*: Leaf nodes are linked together, making range queries and
|
||||
sequential access very efficient.
|
||||
|
||||
- *Insert (key, data)*:
|
||||
- Insert key in the appropriate leaf node in sorted order;
|
||||
- If the node overflows (more than $n−1$ keys), split it, add the middle
|
||||
key to the parent, and adjust pointers;
|
||||
+ Leaf split: $1$ to $ceil(frac(n,2)) $ and $ceil(frac(n,2)) + 1 $ to
|
||||
$n$ as two leafs. Promote the lowest from the 2nd one.
|
||||
+ Node split: $1$ to $ceil(frac(n+1, 2)) - 1 $ and $ceil(frac(n,2)) + 1 $.
|
||||
$ceil(frac(n+1, 2)) - 1 $ gets moved up.
|
||||
- If a split propagates to the root and causes the root to overflow, split
|
||||
the root and create a new root. Note: root can contain less than
|
||||
$ceil(frac(n,2)) - 1$ keys.
|
||||
- *Delete (key)*:
|
||||
- Remove the key from the leaf node.
|
||||
- If the node underflows (fewer than $⌈n/2⌉−1$ keys), keys and pointers are
|
||||
redistributed or nodes are merged to maintain minimum occupancy. -
|
||||
Adjustments may propagate up to ensure all properties are maintained.
|
||||
|
||||
== Hash-index
|
||||
|
||||
*Hash indices* are a type of database index that uses a hash function to
|
||||
compute the location (hash value) of data items for quick retrieval. They are
|
||||
particularly efficient for equality searches that match exact values.
|
||||
|
||||
*Hash Function*: A hash function takes a key (a data item's attribute used for
|
||||
indexing) and converts it into a hash value. This hash value determines the
|
||||
position in the hash table where the corresponding record's pointer is stored.
|
||||
*Hash Table*: The hash table stores pointers to the actual data records in the
|
||||
database. Each entry in the hash table corresponds to a potential hash value
|
||||
generated by the hash function.
|
||||
|
||||
= Algorithms
|
||||
|
||||
|
||||
== Nested-loop join
|
||||
|
||||
=== Overview
|
||||
*Nested Loop Join*: A nested loop join is a database join operation where each
|
||||
tuple of the outer table is compared against every tuple of the inner table to
|
||||
find all pairs of tuples which satisfy the join condition. This method is
|
||||
simple but can be inefficient for large datasets due to its high computational
|
||||
cost.
|
||||
|
||||
=== Cost
|
||||
```python
|
||||
Simplified version (to get the idea)
|
||||
for each tuple tr in r: (for each tuple ts in s: test pair (tr, ts))
|
||||
```
|
||||
|
||||
// TODO: Add seek information
|
||||
Block transfer cost: $n_r ∗ b_s + b_r$ block transfers would be required,
|
||||
where $b_r$ -- blocks in relation$r$, same for $s$.
|
||||
|
||||
== Block-nested join
|
||||
|
||||
=== Overview
|
||||
*Block Nested Loop Join*: A block nested loop join is an optimized version of the
|
||||
nested loop join that reads and holds a block of rows from the outer table in
|
||||
memory and then loops through the inner table, reducing the number of disk
|
||||
accesses and improving performance over a standard nested loop join, especially
|
||||
when indices are not available.
|
||||
|
||||
=== Cost
|
||||
|
||||
```python
|
||||
Simplified version (to get the idea)
|
||||
for each block Br of r: for each block Bs of s:
|
||||
for each tuple tr in r: (for each tuple ts in s: test pair (tr, ts))
|
||||
```
|
||||
|
||||
// TODO: Add seek information
|
||||
Block transfer cost: $b_r ∗ b_s + b_r$, $b_r$ -- blocks in relation $r$, same
|
||||
for $s$.
|
||||
|
||||
== Merge join
|
||||
|
||||
=== Overview
|
||||
*Merge Join*: A merge join is a database join operation where both the outer
|
||||
and inner tables are first sorted on the join key, and then merged together by
|
||||
sequentially scanning through both tables to find matching pairs. This method
|
||||
is highly efficient when the tables are *already sorted* or can be *sorted
|
||||
quickly*, minimizes random disk access. Merge-join method is efficient; the
|
||||
number of block transfers is equal to the sum of the number of blocks in both
|
||||
files, $b_r + b_s$.
|
||||
Assuming that $bb$ buffer blocks are allocated to each relation, the number of disk
|
||||
seeks required would be $⌈b_r∕b_b⌉+ ⌈b_s∕b_b⌉$ disk seeks
|
||||
|
||||
+ Sort Both Tables: If not already sorted, the outer table and the inner table
|
||||
are sorted based on the join keys.
|
||||
+ Merge: Once both tables are sorted, the algorithm performs a merging
|
||||
operation similar to that used in merge sort:
|
||||
+ Begin with the first record of each table.
|
||||
+ Compare the join keys of the current records from both tables.
|
||||
+ If the keys match, join the records and move to the next record in both tables.
|
||||
+ If the join key of the outer table is smaller, move to the next record in
|
||||
the outer table.
|
||||
+ If the join key of the inner table is smaller, move to the next record in
|
||||
the inner table.
|
||||
+ Continue this process until all records in either table have been examined.
|
||||
+ Output the Joined Rows;
|
||||
|
||||
|
||||
=== Cost
|
||||
|
||||
== Hash-join
|
||||
|
||||
=== Overview
|
||||
*Hash Join*: A hash join is a database join operation that builds an in-memory
|
||||
hash table using the join key from the smaller, often called the build table,
|
||||
and then probes this hash table using the join key from the larger, or probe
|
||||
table, to find matching pairs. This technique is very efficient for *large
|
||||
datasets* where *indexes are not present*, as it reduces the need for nested
|
||||
loops.
|
||||
|
||||
=== Cost
|
||||
- $h$ is a hash function mapping JoinAttrs values to ${0, 1, … , n_h}$, where
|
||||
JoinAttrs denotes the common attributes of r and s used in the natural join.
|
||||
- $r_0$, $r_1$, … , rnh denote partitions of r tuples, each initially empty.
|
||||
Each tuple $t_r in r$ is put in partition $r_i$, where $i = h(t_r [#[JoinAttrs]])$.
|
||||
- $s_0$, $s_1$, ..., $s_n_h$ denote partitions of s tuples, each initially empty.
|
||||
Each tuple $t_s in s$ is put in partition $s_i$, where $i = h(t_s [#[JoinAttrs]])$.
|
||||
|
||||
Cost of block transfers: $3(b_r + b_s) + 4 n_h$. The hash join thus requires
|
||||
$2(⌈b_r∕b_b⌉+⌈b_s∕b_b⌉)+ 2n_h$ seeks.
|
||||
|
||||
$b_b$ blocks are allocated for the input buffer and each output buffer.
|
||||
|
||||
+ Build Phase:
|
||||
+ Choose the smaller table (to minimize memory usage) as the "build table."
|
||||
+ Create an in-memory hash table. For each record in the build table,
|
||||
compute a hash on the join key and insert the record into the hash table
|
||||
using this hash value as an index.
|
||||
+ Probe Phase:
|
||||
+ Take each record from the larger table, which is often referred to as the
|
||||
"probe table."
|
||||
+ Compute the hash on the join key (same hash function used in the build
|
||||
phase).
|
||||
+ Use this hash value to look up in the hash table built from the smaller
|
||||
table.
|
||||
+ If the bucket (determined by the hash) contains any entries, check each
|
||||
entry to see if the join key actually matches the join key of the record
|
||||
from the probe table (since hash functions can lead to collisions).
|
||||
+ Output the Joined Rows.
|
||||
|
||||
|
||||
= Relational-algebra
|
||||
|
||||
== Equivalence rules
|
||||
|
||||
- *Commutativity*: $R∪S=S∪R$; Intersection: $R∩S=S∩R$; Join: $R join S=S
|
||||
join R$; Selection : $ sigma p_1( sigma p_2(R))= sigma p_2( sigma p_1(R))$.
|
||||
- *Associativity*: $(R∪S)∪T=R∪(S∪T)$; Intersection: $(R∩S)∩T=R∩(S∩T)$;
|
||||
Join: $(R join S) join T=R join (S join T)$; Theta joins are associative in
|
||||
the following manner: $(E_1 join_theta_1 E_2) join_(theta_2 and theta_3)
|
||||
E_3 ≡E_1 join_(theta_1 or theta_3) (E_2 join_theta_2 E_3)$
|
||||
- *Distributivity*: Distributivity of Union over Intersection:
|
||||
$R∪(S∩T)=(R∪S)∩(R∪T)$; Intersection over Union: $R∩(S∪T)=(R∩S)∪(R∩T)$ Join over
|
||||
Union: $R join (S∪T)=(R join S)∪(R join T)$; Selection Over Union:
|
||||
$ sigma p(R∪S)= sigma p(R)∪ sigma p(S)$; Projection Over Union: $pi c(R∪S)=pi c(R)∪pi c(S)$;
|
||||
- Selection and Join Commutativity: $ sigma p(R join S)= sigma p(R) join S$ if
|
||||
p involves only attributes of R
|
||||
- Pushing Selections Through Joins: $ sigma p(R join S)=( sigma p(R)) join S$
|
||||
when p only involves attributes of R
|
||||
- Pushing Projections Through Joins: $pi c(R join S)=pi c(pi_(c sect #[attr])
|
||||
(R) join pi_(c sect #[attr]) (S))$
|
||||
|
||||
== Operations
|
||||
// FROM Database concepts
|
||||
+ $σ_(θ_1∧θ_2)(E) ≡σ_(θ_1) (σ_(θ_2)(E))$
|
||||
+ $σ_(θ_1)(σ_(θ_2)(E)) ≡σ_(θ_2)(σ_(θ_1)(E))$
|
||||
+ $Π_(L_1)(Π_(L_2)(… (Π_(L_n)(E)) …)) ≡Π_(L_1)(E)$ -- only the last one matters.
|
||||
+ Selections can be combined with Cartesian products and theta joins: $σ_θ(E_1
|
||||
× E_2) ≡E_1 ⋈_θ E_2$ - This expression is just the definition of the theta
|
||||
join |||| $σ_(θ_1)(E_1 ⋈_(θ_2) E_2) ≡E_1 ⋈_(θ_1) ∧ θ_2 E_2$
|
||||
+ $E_1 ⋈_θ E_2 ≡E_2 ⋈_θ E_1$
|
||||
+ Join associativity: $(E_1 ⋈ E_2) ⋈ E_3 ≡E_1 ⋈(E_2 ⋈E_3)$ |||| $(E_1 join_theta_1
|
||||
E_2) join_(theta_2 and theta_3) |||| E_3 ≡E_1 join_(theta_1 or theta_3) (E_2
|
||||
join_theta_2 E_3)$
|
||||
+ Selection distribution: $σ_(θ_1)(E_1 ⋈_θ E_2) ≡(σ_(θ_1) (E_1)) ⋈_θ E_2$;
|
||||
$σ_(θ_1∧θ_2)(E_1 ⋈_θ E_2) ≡(σ_(θ_1)(E_1)) ⋈_θ (σ_(θ_2)(E_2))$
|
||||
+ Projection distribution: - $Π_(L_1∪L_2) (E_1 ⋈_θ E_2) ≡(Π_(L_1(E_1)) ⋈_θ
|
||||
(Π_(L_2)(E_2))$ |||| $Π(L_1∪L_2) (E_1 ⋈_θ E_2) ≡Π_(L_1∪L_2) ((Π_(L_1∪L_3) (E_1))
|
||||
⋈_θ (Π_(L_2∪L_4) (E_2)))$
|
||||
+ Union and intersection commmutativity: E1 ∪E2 ≡E2 ∪E1 |||| - E1 ∩E2 ≡E2 ∩E1
|
||||
+ Set union and intersection are associative: (E1 ∪E2) ∪E3 ≡E1 ∪(E2 ∪E3) |||| (E1
|
||||
∩E2) ∩E3 ≡E1 ∩(E2 ∩E3);
|
||||
+ The selection operation distributes over the union, intersection, and
|
||||
set-difference operations: σθ(E1 ∪E2) ≡σθ(E1) ∪σθ(E2) |||| σθ(E1 ∩E2) ≡σθ(E1)
|
||||
∩σθ(E2) |||| σθ(E1 −E2) ≡σθ(E1) −σθ(E2) |||| σθ(E1 ∩E2) ≡σθ(E1) ∩E2 |||| σθ(E1 −E2) ≡σθ(E1)
|
||||
−E2 |||| 12.
|
||||
+ The projection operation distributes over the union operation - ΠL(E1
|
||||
∪E2) ≡(ΠL(E1)) ∪(ΠL(E2))
|
||||
|
||||
- Projection ($pi$). Syntax: $pi_{#[attributes]}(R)$. Purpose: Reduces the
|
||||
relation to only contain specified attributes. Example: $pi_{#[Name,
|
||||
Age}]}(#[Employees])$
|
||||
|
||||
- Selection ($sigma$). Syntax: $sigma_{#[condition]}(R)$. Purpose: Filters rows
|
||||
that meet the condition. Example: $sigma_{#[Age] > 30}(#[Employees])$
|
||||
|
||||
- Union ($union$). Syntax: $R union S$. Purpose: Combines tuples from both
|
||||
relations, removing duplicates. Requirement: Relations must be
|
||||
union-compatible.
|
||||
|
||||
- Intersection ($sect$). Syntax: $R sect S$. Purpose: Retrieves tuples common
|
||||
to both relations. Requirement: Relations must be union-compatible.
|
||||
|
||||
- Difference ($-$). Syntax: $R - S$. Purpose: Retrieves tuples in R that are
|
||||
not in S. Requirement: Relations must be union-compatible.
|
||||
|
||||
- Cartesian Product ($times$). Syntax: $R times S$. Purpose: Combines tuples
|
||||
from R with every tuple from S.
|
||||
|
||||
- Natural Join ($join$). Syntax: $R join S$. Purpose: Combines tuples from R
|
||||
and S based on common attribute values.
|
||||
|
||||
- Theta Join ($join_theta$). Syntax: $R join_theta S$. Purpose: Combines tuples
|
||||
from R and S where the theta condition holds.
|
||||
|
||||
- Full Outer Join: $R join.l.r S$. Left Outer Join: $R join.l S$.
|
||||
Right Outer Join: $R join.r S$. Purpose: Extends join to include non-matching
|
||||
tuples from one or both relations, filling with nulls.
|
||||
// == Operations
|
||||
//
|
||||
// - Projection ($pi$). Syntax: $pi_{#[attributes]}(R)$. Purpose: Reduces the
|
||||
// relation to only contain specified attributes. Example: $pi_{#[Name,
|
||||
// Age}]}(#[Employees])$
|
||||
//
|
||||
// - Selection ($sigma$). Syntax: $sigma_{#[condition]}(R)$. Purpose: Filters rows
|
||||
// that meet the condition. Example: $sigma_{#[Age] > 30}(#[Employees])$
|
||||
//
|
||||
// - Union ($union$). Syntax: $R union S$. Purpose: Combines tuples from both
|
||||
// relations, removing duplicates. Requirement: Relations must be
|
||||
// union-compatible.
|
||||
//
|
||||
// - Intersection ($sect$). Syntax: $R sect S$. Purpose: Retrieves tuples common
|
||||
// to both relations. Requirement: Relations must be union-compatible.
|
||||
//
|
||||
// - Difference ($-$). Syntax: $R - S$. Purpose: Retrieves tuples in R that are
|
||||
// not in S. Requirement: Relations must be union-compatible.
|
||||
//
|
||||
// - Cartesian Product ($times$). Syntax: $R times S$. Purpose: Combines tuples
|
||||
// from R with every tuple from S.
|
||||
//
|
||||
// - Natural Join ($join$). Syntax: $R join S$. Purpose: Combines tuples from R
|
||||
// and S based on common attribute values.
|
||||
//
|
||||
// - Theta Join ($join_theta$). Syntax: $R join_theta S$. Purpose: Combines tuples
|
||||
// from R and S where the theta condition holds.
|
||||
//
|
||||
// - Full Outer Join: $R join.l.r S$. Left Outer Join: $R join.l S$.
|
||||
// Right Outer Join: $R join.r S$. Purpose: Extends join to include non-matching
|
||||
// tuples from one or both relations, filling with nulls.
|
||||
|
||||
|
||||
= Concurrency
|
||||
@ -171,7 +306,9 @@ conflict serializable.
|
||||
- *Read committed* allows only committed data to be read, but does not require re- peatable reads.
|
||||
- *Read uncommitted* allows uncommitted data to be read. Lowest isolation level allowed by SQL.
|
||||
|
||||
== Schedule
|
||||
|
||||
|
||||
== Protocols
|
||||
|
||||
We say that a schedule S is *legal* under a given locking protocol if S is a possible
|
||||
schedule for a set of transactions that follows the rules of the locking protocol. We say
|
||||
@ -179,27 +316,47 @@ that a locking protocol ensures conflict serializability if and only if all lega
|
||||
are *conflict serializable*; in other words, for all legal schedules the associated →relation
|
||||
is acyclic.
|
||||
|
||||
== Protocols
|
||||
|
||||
=== Lock-based
|
||||
|
||||
==== Dealock
|
||||
==== 2-phased lock protocol
|
||||
|
||||
*The Two-Phase Locking (2PL)* Protocol is a concurrency control method used in
|
||||
database systems to ensure serializability of transactions. The protocol
|
||||
involves two distinct phases: *Locking Phase (Growing Phase):* A transaction
|
||||
may acquire locks but cannot release any locks. During this phase, the
|
||||
transaction continues to lock all the resources (data items) it needs to
|
||||
execute. \ *Unlocking Phase (Shrinking Phase):* The transaction releases locks
|
||||
and cannot acquire any new ones. Once a transaction starts releasing locks, it
|
||||
moves into this phase until all locks are released.
|
||||
|
||||
==== Problems of locks
|
||||
|
||||
*Deadlock* is a condition where two or more tasks are each waiting for the
|
||||
other to release a resource, or more than two tasks are waiting for resources
|
||||
in a circular chain.
|
||||
|
||||
==== Starvation
|
||||
|
||||
*Starvation* (also known as indefinite blocking) occurs when a process or
|
||||
thread is perpetually denied necessary resources to process its work. Unlike
|
||||
deadlock, where everything halts, starvation only affects some while others
|
||||
progress.
|
||||
in a circular chain.
|
||||
\ *Starvation* (also known as indefinite blocking) occurs
|
||||
when a process or thread is perpetually denied necessary resources to process
|
||||
its work. Unlike deadlock, where everything halts, starvation only affects some
|
||||
while others progress.
|
||||
|
||||
=== Timestamp-based
|
||||
|
||||
*Timestamp Assignment:* Each transaction is given a unique timestamp when it
|
||||
starts. This timestamp determines the transaction's temporal order relative to
|
||||
others. *Read Rule:* A transaction can read an object if the last write
|
||||
occurred by a transaction with an earlier or the same timestamp. *Write Rule:*
|
||||
A transaction can write to an object if the last read and the last write
|
||||
occurred by transactions with earlier or the same timestamps.
|
||||
|
||||
=== Validation-based
|
||||
|
||||
Assumes that conflicts are rare and checks for them only at the end of a transaction.
|
||||
*Working Phase:* Transactions execute without acquiring locks, recording all
|
||||
data reads and writes. *Validation Phase:* Before committing, each transaction
|
||||
must validate that no other transactions have modified the data it accessed.
|
||||
*Commit Phase:* If the validation is successful, the transaction commits and
|
||||
applies its changes. If not, it rolls back and may be restarted.
|
||||
|
||||
=== Version isolation
|
||||
|
||||
= Logs
|
||||
|
||||
Loading…
Reference in New Issue
Block a user