diff --git a/img/bitmap.png b/img/bitmap.png new file mode 100644 index 0000000..a2998ea Binary files /dev/null and b/img/bitmap.png differ diff --git a/main.typ b/main.typ index edccfa4..c5b271f 100644 --- a/main.typ +++ b/main.typ @@ -1,17 +1,17 @@ #set page(margin: ( - top: 1cm, - bottom: 1cm, - right: 1cm, - left: 1cm, + top: 0.6cm, + bottom: 0.6cm, + right: 0.6cm, + left: 0.6cm, )) -#set text(7pt) +#set text(6.2pt) #show heading: it => { if it.level == 1 { // pagebreak(weak: true) - text(10pt, upper(it)) + text(8.5pt, upper(it)) } else if it.level == 2 { - text(9pt, smallcaps(it)) + text(8pt, smallcaps(it)) } else { text(8pt, smallcaps(it)) } @@ -22,90 +22,225 @@ == Bitmap +Each bit in a bitmap corresponds to a possible item or condition, with a bit +set to 1 indicating presence or true, and a bit set to 0 indicating absence or +false. + +#figure( + image("img/bitmap.png", width: 30%) +) + == B+ tree +*B+ tree* is a type of self-balancing tree data structure that maintains data +sorted and allows searches, sequential access, insertions, and deletions in +logarithmic time. It is an extension of the B-tree and is extensively used in +databases and filesystems for indexing. B+ tree is *Balanced*; Order (n): +Defined such that each node (except root) can have at most $n$ children +(pointers) and at least $⌈n/2⌉$ children; *Internal nodes hold* between +⌈n/2⌉−1⌈n/2⌉−1 and n−1n−1 keys; Leaf nodes also hold between $⌈n/2⌉−1$ and +$n−1$ keys but also store all data values corresponding to the keys; *Leaf +Nodes Linked*: Leaf nodes are linked together, making range queries and +sequential access very efficient. + +- *Insert (key, data)*: + - Insert key in the appropriate leaf node in sorted order; + - If the node overflows (more than $n−1$ keys), split it, add the middle + key to the parent, and adjust pointers; + + Leaf split: $1$ to $ceil(frac(n,2)) $ and $ceil(frac(n,2)) + 1 $ to + $n$ as two leafs. Promote the lowest from the 2nd one. + + Node split: $1$ to $ceil(frac(n+1, 2)) - 1 $ and $ceil(frac(n,2)) + 1 $. + $ceil(frac(n+1, 2)) - 1 $ gets moved up. + - If a split propagates to the root and causes the root to overflow, split + the root and create a new root. Note: root can contain less than + $ceil(frac(n,2)) - 1$ keys. +- *Delete (key)*: + - Remove the key from the leaf node. + - If the node underflows (fewer than $⌈n/2⌉−1$ keys), keys and pointers are + redistributed or nodes are merged to maintain minimum occupancy. - + Adjustments may propagate up to ensure all properties are maintained. + == Hash-index +*Hash indices* are a type of database index that uses a hash function to +compute the location (hash value) of data items for quick retrieval. They are +particularly efficient for equality searches that match exact values. + +*Hash Function*: A hash function takes a key (a data item's attribute used for +indexing) and converts it into a hash value. This hash value determines the +position in the hash table where the corresponding record's pointer is stored. +*Hash Table*: The hash table stores pointers to the actual data records in the +database. Each entry in the hash table corresponds to a potential hash value +generated by the hash function. + = Algorithms == Nested-loop join -=== Overview +*Nested Loop Join*: A nested loop join is a database join operation where each +tuple of the outer table is compared against every tuple of the inner table to +find all pairs of tuples which satisfy the join condition. This method is +simple but can be inefficient for large datasets due to its high computational +cost. -=== Cost +```python +Simplified version (to get the idea) +for each tuple tr in r: (for each tuple ts in s: test pair (tr, ts)) +``` + +// TODO: Add seek information +Block transfer cost: $n_r ∗ b_s + b_r$ block transfers would be required, +where $b_r$ -- blocks in relation$r$, same for $s$. == Block-nested join -=== Overview +*Block Nested Loop Join*: A block nested loop join is an optimized version of the +nested loop join that reads and holds a block of rows from the outer table in +memory and then loops through the inner table, reducing the number of disk +accesses and improving performance over a standard nested loop join, especially +when indices are not available. -=== Cost + +```python +Simplified version (to get the idea) +for each block Br of r: for each block Bs of s: + for each tuple tr in r: (for each tuple ts in s: test pair (tr, ts)) +``` + +// TODO: Add seek information +Block transfer cost: $b_r ∗ b_s + b_r$, $b_r$ -- blocks in relation $r$, same +for $s$. == Merge join -=== Overview +*Merge Join*: A merge join is a database join operation where both the outer +and inner tables are first sorted on the join key, and then merged together by +sequentially scanning through both tables to find matching pairs. This method +is highly efficient when the tables are *already sorted* or can be *sorted +quickly*, minimizes random disk access. Merge-join method is efficient; the +number of block transfers is equal to the sum of the number of blocks in both +files, $b_r + b_s$. +Assuming that $bb$ buffer blocks are allocated to each relation, the number of disk +seeks required would be $⌈b_r∕b_b⌉+ ⌈b_s∕b_b⌉$ disk seeks + ++ Sort Both Tables: If not already sorted, the outer table and the inner table + are sorted based on the join keys. ++ Merge: Once both tables are sorted, the algorithm performs a merging + operation similar to that used in merge sort: + + Begin with the first record of each table. + + Compare the join keys of the current records from both tables. + + If the keys match, join the records and move to the next record in both tables. + + If the join key of the outer table is smaller, move to the next record in + the outer table. + + If the join key of the inner table is smaller, move to the next record in + the inner table. + + Continue this process until all records in either table have been examined. ++ Output the Joined Rows; + -=== Cost == Hash-join -=== Overview +*Hash Join*: A hash join is a database join operation that builds an in-memory +hash table using the join key from the smaller, often called the build table, +and then probes this hash table using the join key from the larger, or probe +table, to find matching pairs. This technique is very efficient for *large +datasets* where *indexes are not present*, as it reduces the need for nested +loops. -=== Cost +- $h$ is a hash function mapping JoinAttrs values to ${0, 1, … , n_h}$, where + JoinAttrs denotes the common attributes of r and s used in the natural join. +- $r_0$, $r_1$, … , rnh denote partitions of r tuples, each initially empty. + Each tuple $t_r in r$ is put in partition $r_i$, where $i = h(t_r [#[JoinAttrs]])$. +- $s_0$, $s_1$, ..., $s_n_h$ denote partitions of s tuples, each initially empty. + Each tuple $t_s in s$ is put in partition $s_i$, where $i = h(t_s [#[JoinAttrs]])$. + +Cost of block transfers: $3(b_r + b_s) + 4 n_h$. The hash join thus requires +$2(⌈b_r∕b_b⌉+⌈b_s∕b_b⌉)+ 2n_h$ seeks. + +$b_b$ blocks are allocated for the input buffer and each output buffer. + ++ Build Phase: + + Choose the smaller table (to minimize memory usage) as the "build table." + + Create an in-memory hash table. For each record in the build table, + compute a hash on the join key and insert the record into the hash table + using this hash value as an index. ++ Probe Phase: + + Take each record from the larger table, which is often referred to as the + "probe table." + + Compute the hash on the join key (same hash function used in the build + phase). + + Use this hash value to look up in the hash table built from the smaller + table. + + If the bucket (determined by the hash) contains any entries, check each + entry to see if the join key actually matches the join key of the record + from the probe table (since hash functions can lead to collisions). ++ Output the Joined Rows. = Relational-algebra == Equivalence rules -- *Commutativity*: $R∪S=S∪R$; Intersection: $R∩S=S∩R$; Join: $R join S=S - join R$; Selection : $ sigma p_1( sigma p_2(R))= sigma p_2( sigma p_1(R))$. -- *Associativity*: $(R∪S)∪T=R∪(S∪T)$; Intersection: $(R∩S)∩T=R∩(S∩T)$; - Join: $(R join S) join T=R join (S join T)$; Theta joins are associative in - the following manner: $(E_1 join_theta_1 E_2) join_(theta_2 and theta_3) - E_3 ≡E_1 join_(theta_1 or theta_3) (E_2 join_theta_2 E_3)$ -- *Distributivity*: Distributivity of Union over Intersection: - $R∪(S∩T)=(R∪S)∩(R∪T)$; Intersection over Union: $R∩(S∪T)=(R∩S)∪(R∩T)$ Join over - Union: $R join (S∪T)=(R join S)∪(R join T)$; Selection Over Union: - $ sigma p(R∪S)= sigma p(R)∪ sigma p(S)$; Projection Over Union: $pi c(R∪S)=pi c(R)∪pi c(S)$; -- Selection and Join Commutativity: $ sigma p(R join S)= sigma p(R) join S$ if - p involves only attributes of R -- Pushing Selections Through Joins: $ sigma p(R join S)=( sigma p(R)) join S$ - when p only involves attributes of R -- Pushing Projections Through Joins: $pi c(R join S)=pi c(pi_(c sect #[attr]) - (R) join pi_(c sect #[attr]) (S))$ -== Operations +// FROM Database concepts ++ $σ_(θ_1∧θ_2)(E) ≡σ_(θ_1) (σ_(θ_2)(E))$ ++ $σ_(θ_1)(σ_(θ_2)(E)) ≡σ_(θ_2)(σ_(θ_1)(E))$ ++ $Π_(L_1)(Π_(L_2)(… (Π_(L_n)(E)) …)) ≡Π_(L_1)(E)$ -- only the last one matters. ++ Selections can be combined with Cartesian products and theta joins: $σ_θ(E_1 + × E_2) ≡E_1 ⋈_θ E_2$ - This expression is just the definition of the theta + join |||| $σ_(θ_1)(E_1 ⋈_(θ_2) E_2) ≡E_1 ⋈_(θ_1) ∧ θ_2 E_2$ ++ $E_1 ⋈_θ E_2 ≡E_2 ⋈_θ E_1$ ++ Join associativity: $(E_1 ⋈ E_2) ⋈ E_3 ≡E_1 ⋈(E_2 ⋈E_3)$ |||| $(E_1 join_theta_1 + E_2) join_(theta_2 and theta_3) |||| E_3 ≡E_1 join_(theta_1 or theta_3) (E_2 + join_theta_2 E_3)$ ++ Selection distribution: $σ_(θ_1)(E_1 ⋈_θ E_2) ≡(σ_(θ_1) (E_1)) ⋈_θ E_2$; + $σ_(θ_1∧θ_2)(E_1 ⋈_θ E_2) ≡(σ_(θ_1)(E_1)) ⋈_θ (σ_(θ_2)(E_2))$ ++ Projection distribution: - $Π_(L_1∪L_2) (E_1 ⋈_θ E_2) ≡(Π_(L_1(E_1)) ⋈_θ + (Π_(L_2)(E_2))$ |||| $Π(L_1∪L_2) (E_1 ⋈_θ E_2) ≡Π_(L_1∪L_2) ((Π_(L_1∪L_3) (E_1)) + ⋈_θ (Π_(L_2∪L_4) (E_2)))$ ++ Union and intersection commmutativity: E1 ∪E2 ≡E2 ∪E1 |||| - E1 ∩E2 ≡E2 ∩E1 ++ Set union and intersection are associative: (E1 ∪E2) ∪E3 ≡E1 ∪(E2 ∪E3) |||| (E1 + ∩E2) ∩E3 ≡E1 ∩(E2 ∩E3); ++ The selection operation distributes over the union, intersection, and + set-difference operations: σθ(E1 ∪E2) ≡σθ(E1) ∪σθ(E2) |||| σθ(E1 ∩E2) ≡σθ(E1) + ∩σθ(E2) |||| σθ(E1 −E2) ≡σθ(E1) −σθ(E2) |||| σθ(E1 ∩E2) ≡σθ(E1) ∩E2 |||| σθ(E1 −E2) ≡σθ(E1) + −E2 |||| 12. ++ The projection operation distributes over the union operation - ΠL(E1 + ∪E2) ≡(ΠL(E1)) ∪(ΠL(E2)) -- Projection ($pi$). Syntax: $pi_{#[attributes]}(R)$. Purpose: Reduces the - relation to only contain specified attributes. Example: $pi_{#[Name, - Age}]}(#[Employees])$ - -- Selection ($sigma$). Syntax: $sigma_{#[condition]}(R)$. Purpose: Filters rows - that meet the condition. Example: $sigma_{#[Age] > 30}(#[Employees])$ - -- Union ($union$). Syntax: $R union S$. Purpose: Combines tuples from both - relations, removing duplicates. Requirement: Relations must be - union-compatible. - -- Intersection ($sect$). Syntax: $R sect S$. Purpose: Retrieves tuples common - to both relations. Requirement: Relations must be union-compatible. - -- Difference ($-$). Syntax: $R - S$. Purpose: Retrieves tuples in R that are - not in S. Requirement: Relations must be union-compatible. - -- Cartesian Product ($times$). Syntax: $R times S$. Purpose: Combines tuples - from R with every tuple from S. - -- Natural Join ($join$). Syntax: $R join S$. Purpose: Combines tuples from R - and S based on common attribute values. - -- Theta Join ($join_theta$). Syntax: $R join_theta S$. Purpose: Combines tuples - from R and S where the theta condition holds. - -- Full Outer Join: $R join.l.r S$. Left Outer Join: $R join.l S$. - Right Outer Join: $R join.r S$. Purpose: Extends join to include non-matching - tuples from one or both relations, filling with nulls. +// == Operations +// +// - Projection ($pi$). Syntax: $pi_{#[attributes]}(R)$. Purpose: Reduces the +// relation to only contain specified attributes. Example: $pi_{#[Name, +// Age}]}(#[Employees])$ +// +// - Selection ($sigma$). Syntax: $sigma_{#[condition]}(R)$. Purpose: Filters rows +// that meet the condition. Example: $sigma_{#[Age] > 30}(#[Employees])$ +// +// - Union ($union$). Syntax: $R union S$. Purpose: Combines tuples from both +// relations, removing duplicates. Requirement: Relations must be +// union-compatible. +// +// - Intersection ($sect$). Syntax: $R sect S$. Purpose: Retrieves tuples common +// to both relations. Requirement: Relations must be union-compatible. +// +// - Difference ($-$). Syntax: $R - S$. Purpose: Retrieves tuples in R that are +// not in S. Requirement: Relations must be union-compatible. +// +// - Cartesian Product ($times$). Syntax: $R times S$. Purpose: Combines tuples +// from R with every tuple from S. +// +// - Natural Join ($join$). Syntax: $R join S$. Purpose: Combines tuples from R +// and S based on common attribute values. +// +// - Theta Join ($join_theta$). Syntax: $R join_theta S$. Purpose: Combines tuples +// from R and S where the theta condition holds. +// +// - Full Outer Join: $R join.l.r S$. Left Outer Join: $R join.l S$. +// Right Outer Join: $R join.r S$. Purpose: Extends join to include non-matching +// tuples from one or both relations, filling with nulls. = Concurrency @@ -171,7 +306,9 @@ conflict serializable. - *Read committed* allows only committed data to be read, but does not require re- peatable reads. - *Read uncommitted* allows uncommitted data to be read. Lowest isolation level allowed by SQL. -== Schedule + + +== Protocols We say that a schedule S is *legal* under a given locking protocol if S is a possible schedule for a set of transactions that follows the rules of the locking protocol. We say @@ -179,27 +316,47 @@ that a locking protocol ensures conflict serializability if and only if all lega are *conflict serializable*; in other words, for all legal schedules the associated →relation is acyclic. -== Protocols - === Lock-based -==== Dealock +==== 2-phased lock protocol + +*The Two-Phase Locking (2PL)* Protocol is a concurrency control method used in +database systems to ensure serializability of transactions. The protocol +involves two distinct phases: *Locking Phase (Growing Phase):* A transaction +may acquire locks but cannot release any locks. During this phase, the +transaction continues to lock all the resources (data items) it needs to +execute. \ *Unlocking Phase (Shrinking Phase):* The transaction releases locks +and cannot acquire any new ones. Once a transaction starts releasing locks, it +moves into this phase until all locks are released. + +==== Problems of locks *Deadlock* is a condition where two or more tasks are each waiting for the other to release a resource, or more than two tasks are waiting for resources -in a circular chain. - -==== Starvation - -*Starvation* (also known as indefinite blocking) occurs when a process or -thread is perpetually denied necessary resources to process its work. Unlike -deadlock, where everything halts, starvation only affects some while others -progress. +in a circular chain. +\ *Starvation* (also known as indefinite blocking) occurs +when a process or thread is perpetually denied necessary resources to process +its work. Unlike deadlock, where everything halts, starvation only affects some +while others progress. === Timestamp-based +*Timestamp Assignment:* Each transaction is given a unique timestamp when it +starts. This timestamp determines the transaction's temporal order relative to +others. *Read Rule:* A transaction can read an object if the last write +occurred by a transaction with an earlier or the same timestamp. *Write Rule:* +A transaction can write to an object if the last read and the last write +occurred by transactions with earlier or the same timestamps. + === Validation-based +Assumes that conflicts are rare and checks for them only at the end of a transaction. +*Working Phase:* Transactions execute without acquiring locks, recording all +data reads and writes. *Validation Phase:* Before committing, each transaction +must validate that no other transactions have modified the data it accessed. +*Commit Phase:* If the validation is successful, the transaction commits and +applies its changes. If not, it rolls back and may be restarted. + === Version isolation = Logs