v2

2026-02-04 06:22:04 +00:00 · 2024-05-05 17:27:42 +03:00
parent d7d61184ea
commit c0b7597c98
2 changed files with 230 additions and 73 deletions
--- a/img/bitmap.png
+++ b/img/bitmap.png
--- a/main.typ
+++ b/main.typ
@@ -1,17 +1,17 @@
 #set page(margin: (
-  top: 1cm,
-  bottom: 1cm,
-  right: 1cm,
-  left: 1cm,
+  top: 0.6cm,
+  bottom: 0.6cm,
+  right: 0.6cm,
+  left: 0.6cm,
 ))

-#set text(7pt)
+#set text(6.2pt)
 #show heading: it => {
  if it.level == 1 {
    // pagebreak(weak: true)
-    text(10pt, upper(it))
+    text(8.5pt, upper(it))
  } else if it.level == 2 {
-    text(9pt, smallcaps(it)) 
+    text(8pt, smallcaps(it)) 
  } else {
    text(8pt, smallcaps(it)) 
  }
@@ -22,90 +22,225 @@

 == Bitmap

+Each bit in a bitmap corresponds to a possible item or condition, with a bit
+set to 1 indicating presence or true, and a bit set to 0 indicating absence or
+false.
+
+#figure(
+  image("img/bitmap.png", width: 30%) 
+) 
+
 == B+ tree

+*B+ tree* is a type of self-balancing tree data structure that maintains data
+sorted and allows searches, sequential access, insertions, and deletions in
+logarithmic time. It is an extension of the B-tree and is extensively used in
+databases and filesystems for indexing. B+ tree is *Balanced*; Order (n):
+Defined such that each node (except root) can have at most $n$ children
+(pointers) and at least $⌈n/2⌉$ children; *Internal nodes hold* between
+⌈n/2⌉−1⌈n/2⌉−1 and n−1n−1 keys; Leaf nodes also hold between $⌈n/2⌉−1$ and
+$n−1$ keys but also store all data values corresponding to the keys; *Leaf
+Nodes Linked*: Leaf nodes are linked together, making range queries and
+sequential access very efficient.
+
+- *Insert (key, data)*:
+    - Insert key in the appropriate leaf node in sorted order;
+    - If the node overflows (more than $n−1$ keys), split it, add the middle
+      key to the parent, and adjust pointers;
+      + Leaf split: $1$ to $ceil(frac(n,2)) $ and $ceil(frac(n,2)) + 1 $ to
+       $n$ as two leafs. Promote the lowest from the 2nd one.
+      + Node split: $1$ to $ceil(frac(n+1, 2)) - 1 $ and $ceil(frac(n,2)) + 1 $.
+        $ceil(frac(n+1, 2)) - 1 $ gets moved up.
+    - If a split propagates to the root and causes the root to overflow, split
+      the root and create a new root. Note: root can contain less than
+      $ceil(frac(n,2)) - 1$ keys.
+- *Delete (key)*:
+    - Remove the key from the leaf node.
+    - If the node underflows (fewer than $⌈n/2⌉−1$ keys), keys and pointers are
+      redistributed or nodes are merged to maintain minimum occupancy. -
+    Adjustments may propagate up to ensure all properties are maintained.
+
 == Hash-index

+*Hash indices* are a type of database index that uses a hash function to
+compute the location (hash value) of data items for quick retrieval. They are
+particularly efficient for equality searches that match exact values.
+
+*Hash Function*: A hash function takes a key (a data item's attribute used for
+indexing) and converts it into a hash value. This hash value determines the
+position in the hash table where the corresponding record's pointer is stored.
+*Hash Table*: The hash table stores pointers to the actual data records in the
+database. Each entry in the hash table corresponds to a potential hash value
+generated by the hash function.
+
 = Algorithms 


 == Nested-loop join

-=== Overview
+*Nested Loop Join*: A nested loop join is a database join operation where each
+tuple of the outer table is compared against every tuple of the inner table to
+find all pairs of tuples which satisfy the join condition. This method is
+simple but can be inefficient for large datasets due to its high computational
+cost.

-=== Cost
+```python
+Simplified version (to get the idea)
+for each tuple tr in r: (for each tuple ts in s: test pair (tr, ts))
+```
+
+// TODO: Add seek information
+Block transfer cost:  $n_r ∗ b_s + b_r$ block transfers would be required,
+where $b_r$ -- blocks in relation$r$, same for $s$.

 == Block-nested join

-=== Overview
+*Block Nested Loop Join*: A block nested loop join is an optimized version of the
+nested loop join that reads and holds a block of rows from the outer table in
+memory and then loops through the inner table, reducing the number of disk
+accesses and improving performance over a standard nested loop join, especially
+when indices are not available.

-=== Cost
+
+```python
+Simplified version (to get the idea)
+for each block Br of r: for each block Bs of s:
+  for each tuple tr in r: (for each tuple ts in s: test pair (tr, ts))
+```
+
+// TODO: Add seek information
+Block transfer cost: $b_r ∗ b_s + b_r$, $b_r$ -- blocks in relation $r$, same
+for $s$.

 == Merge join

-=== Overview
+*Merge Join*: A merge join is a database join operation where both the outer
+and inner tables are first sorted on the join key, and then merged together by
+sequentially scanning through both tables to find matching pairs. This method
+is highly efficient when the tables are *already sorted* or can be *sorted
+quickly*, minimizes random disk access. Merge-join method is efficient; the
+number of block transfers is equal to the sum of the number of blocks in both
+files, $b_r + b_s$.
+Assuming that $bb$ buffer blocks are allocated to each relation, the number of disk
+seeks required would be $⌈b_r∕b_b⌉+ ⌈b_s∕b_b⌉$ disk seeks
+
+ Sort Both Tables: If not already sorted, the outer table and the inner table
+  are sorted based on the join keys. 
+ Merge: Once both tables are sorted, the algorithm performs a merging
+  operation similar to that used in merge sort:
+    + Begin with the first record of each table.
+    + Compare the join keys of the current records from both tables.
+      + If the keys match, join the records and move to the next record in both tables.
+      + If the join key of the outer table is smaller, move to the next record in
+        the outer table.
+      + If the join key of the inner table is smaller, move to the next record in
+        the inner table.
+    + Continue this process until all records in either table have been examined.
+ Output the Joined Rows;
+

-=== Cost

 == Hash-join

-=== Overview
+*Hash Join*: A hash join is a database join operation that builds an in-memory
+hash table using the join key from the smaller, often called the build table,
+and then probes this hash table using the join key from the larger, or probe
+table, to find matching pairs. This technique is very efficient for *large
+datasets* where *indexes are not present*, as it reduces the need for nested
+loops. 

-=== Cost
+- $h$ is a hash function mapping JoinAttrs values to ${0, 1, … , n_h}$, where
+  JoinAttrs denotes the common attributes of r and s used in the natural join.
+- $r_0$, $r_1$, … , rnh denote partitions of r tuples, each initially empty.
+  Each tuple $t_r in r$ is put in partition $r_i$, where $i = h(t_r [#[JoinAttrs]])$.
+- $s_0$, $s_1$, ..., $s_n_h$ denote partitions of s tuples, each initially empty.
+  Each tuple $t_s in s$ is put in partition $s_i$, where $i = h(t_s [#[JoinAttrs]])$.
+
+Cost of block transfers: $3(b_r + b_s) + 4 n_h$. The hash join thus requires
+$2(⌈b_r∕b_b⌉+⌈b_s∕b_b⌉)+ 2n_h$ seeks. 
+
+$b_b$ blocks are allocated for the input buffer and each output buffer.
+
+ Build Phase:
+    + Choose the smaller table (to minimize memory usage) as the "build table."
+    + Create an in-memory hash table. For each record in the build table,
+      compute a hash on the join key and insert the record into the hash table
+      using this hash value as an index.
+ Probe Phase:
+    + Take each record from the larger table, which is often referred to as the
+      "probe table."
+    + Compute the hash on the join key (same hash function used in the build
+      phase).
+    + Use this hash value to look up in the hash table built from the smaller
+      table.
+    + If the bucket (determined by the hash) contains any entries, check each
+      entry to see if the join key actually matches the join key of the record
+      from the probe table (since hash functions can lead to collisions).
+ Output the Joined Rows.


 = Relational-algebra

 == Equivalence rules

- *Commutativity*: $R∪S=S∪R$; Intersection: $R∩S=S∩R$; Join: $R join S=S
-  join R$; Selection : $ sigma p_1( sigma p_2(R))= sigma p_2( sigma p_1(R))$.
- *Associativity*: $(R∪S)∪T=R∪(S∪T)$; Intersection: $(R∩S)∩T=R∩(S∩T)$;
-  Join: $(R join S) join T=R join (S join T)$; Theta joins are associative in
-  the following manner: $(E_1  join_theta_1 E_2)  join_(theta_2 and theta_3)
-  E_3 ≡E_1  join_(theta_1 or theta_3) (E_2 join_theta_2 E_3)$ 
- *Distributivity*: Distributivity of Union over Intersection:
-  $R∪(S∩T)=(R∪S)∩(R∪T)$; Intersection over Union: $R∩(S∪T)=(R∩S)∪(R∩T)$ Join over
-  Union: $R join (S∪T)=(R join S)∪(R join T)$; Selection Over Union:
-  $ sigma p(R∪S)= sigma p(R)∪ sigma p(S)$; Projection Over Union: $pi c(R∪S)=pi c(R)∪pi c(S)$;
- Selection and Join Commutativity:  $ sigma p(R join S)= sigma p(R) join S$ if
-  p involves only attributes of R
- Pushing Selections Through Joins:  $ sigma p(R join S)=( sigma p(R)) join S$
-  when p only involves attributes of R
- Pushing Projections Through Joins: $pi c(R join S)=pi c(pi_(c sect #[attr])
-  (R) join pi_(c sect #[attr]) (S))$ 

-== Operations
+// FROM Database concepts
+ $σ_(θ_1∧θ_2)(E) ≡σ_(θ_1) (σ_(θ_2)(E))$ 
+ $σ_(θ_1)(σ_(θ_2)(E)) ≡σ_(θ_2)(σ_(θ_1)(E))$ 
+ $Π_(L_1)(Π_(L_2)(… (Π_(L_n)(E)) …)) ≡Π_(L_1)(E)$ -- only the last one matters.
+ Selections can be combined with Cartesian products and theta joins: $σ_θ(E_1
+  × E_2) ≡E_1 ⋈_θ E_2$ - This expression is just the definition of the theta
+  join |||| $σ_(θ_1)(E_1 ⋈_(θ_2) E_2) ≡E_1 ⋈_(θ_1) ∧ θ_2 E_2$ 
+ $E_1 ⋈_θ E_2 ≡E_2 ⋈_θ E_1$ 
+ Join associativity: $(E_1 ⋈ E_2) ⋈ E_3 ≡E_1 ⋈(E_2 ⋈E_3)$ |||| $(E_1  join_theta_1
+  E_2)  join_(theta_2 and theta_3) |||| E_3 ≡E_1  join_(theta_1 or theta_3) (E_2
+  join_theta_2 E_3)$ 
+ Selection distribution: $σ_(θ_1)(E_1 ⋈_θ E_2) ≡(σ_(θ_1) (E_1)) ⋈_θ E_2$;
+  $σ_(θ_1∧θ_2)(E_1 ⋈_θ E_2) ≡(σ_(θ_1)(E_1)) ⋈_θ (σ_(θ_2)(E_2))$
+ Projection distribution: - $Π_(L_1∪L_2) (E_1 ⋈_θ E_2) ≡(Π_(L_1(E_1)) ⋈_θ
+  (Π_(L_2)(E_2))$ ||||  $Π(L_1∪L_2) (E_1 ⋈_θ E_2) ≡Π_(L_1∪L_2) ((Π_(L_1∪L_3) (E_1))
+  ⋈_θ (Π_(L_2∪L_4) (E_2)))$
+ Union and intersection commmutativity: E1 ∪E2 ≡E2 ∪E1 |||| - E1 ∩E2 ≡E2 ∩E1
+ Set union and intersection are associative: (E1 ∪E2) ∪E3 ≡E1 ∪(E2 ∪E3) |||| (E1
+  ∩E2) ∩E3 ≡E1 ∩(E2 ∩E3);
+ The selection operation distributes over the union, intersection, and
+  set-difference operations: σθ(E1 ∪E2) ≡σθ(E1) ∪σθ(E2) |||| σθ(E1 ∩E2) ≡σθ(E1)
+  ∩σθ(E2) |||| σθ(E1 −E2) ≡σθ(E1) −σθ(E2) |||| σθ(E1 ∩E2) ≡σθ(E1) ∩E2 |||| σθ(E1 −E2) ≡σθ(E1)
+  −E2 |||| 12. 
+ The projection operation distributes over the union operation - ΠL(E1
+  ∪E2) ≡(ΠL(E1)) ∪(ΠL(E2))

- Projection ($pi$). Syntax: $pi_{#[attributes]}(R)$. Purpose: Reduces the
-  relation to only contain specified attributes. Example: $pi_{#[Name,
-  Age}]}(#[Employees])$
-
- Selection ($sigma$). Syntax: $sigma_{#[condition]}(R)$. Purpose: Filters rows
-  that meet the condition. Example: $sigma_{#[Age] > 30}(#[Employees])$
-
- Union ($union$). Syntax: $R union S$. Purpose: Combines tuples from both
-  relations, removing duplicates. Requirement: Relations must be
-  union-compatible.
-
- Intersection ($sect$). Syntax: $R sect S$. Purpose: Retrieves tuples common
-  to both relations. Requirement: Relations must be union-compatible.
-
- Difference ($-$). Syntax: $R - S$. Purpose: Retrieves tuples in R that are
-  not in S. Requirement: Relations must be union-compatible.
-
- Cartesian Product ($times$). Syntax: $R times S$. Purpose: Combines tuples
-  from R with every tuple from S.
-
- Natural Join ($join$). Syntax: $R join S$. Purpose: Combines tuples from R
-  and S based on common attribute values.
-
- Theta Join ($join_theta$). Syntax: $R join_theta S$. Purpose: Combines tuples
-  from R and S where the theta condition holds.
-
- Full Outer Join: $R join.l.r S$. Left Outer Join: $R join.l S$.
-  Right Outer Join: $R join.r S$. Purpose: Extends join to include non-matching
-  tuples from one or both relations, filling with nulls.
+// == Operations
+//
+// - Projection ($pi$). Syntax: $pi_{#[attributes]}(R)$. Purpose: Reduces the
+//   relation to only contain specified attributes. Example: $pi_{#[Name,
+//   Age}]}(#[Employees])$
+//
+// - Selection ($sigma$). Syntax: $sigma_{#[condition]}(R)$. Purpose: Filters rows
+//   that meet the condition. Example: $sigma_{#[Age] > 30}(#[Employees])$
+//
+// - Union ($union$). Syntax: $R union S$. Purpose: Combines tuples from both
+//   relations, removing duplicates. Requirement: Relations must be
+//   union-compatible.
+//
+// - Intersection ($sect$). Syntax: $R sect S$. Purpose: Retrieves tuples common
+//   to both relations. Requirement: Relations must be union-compatible.
+//
+// - Difference ($-$). Syntax: $R - S$. Purpose: Retrieves tuples in R that are
+//   not in S. Requirement: Relations must be union-compatible.
+//
+// - Cartesian Product ($times$). Syntax: $R times S$. Purpose: Combines tuples
+//   from R with every tuple from S.
+//
+// - Natural Join ($join$). Syntax: $R join S$. Purpose: Combines tuples from R
+//   and S based on common attribute values.
+//
+// - Theta Join ($join_theta$). Syntax: $R join_theta S$. Purpose: Combines tuples
+//   from R and S where the theta condition holds.
+//
+// - Full Outer Join: $R join.l.r S$. Left Outer Join: $R join.l S$.
+//   Right Outer Join: $R join.r S$. Purpose: Extends join to include non-matching
+//   tuples from one or both relations, filling with nulls.


 = Concurrency 
@@ -171,7 +306,9 @@ conflict serializable.
 - *Read committed* allows only committed data to be read, but does not require re- peatable reads. 
 - *Read uncommitted* allows uncommitted data to be read. Lowest isolation level allowed by SQL.

-== Schedule
+
+
+== Protocols

 We say that a schedule S is *legal* under a given locking protocol if S is a possible
 schedule for a set of transactions that follows the rules of the locking protocol. We say
@@ -179,27 +316,47 @@ that a locking protocol ensures conflict serializability if and only if all lega
 are *conflict serializable*; in other words, for all legal schedules the associated →relation
 is acyclic.

-== Protocols
-
 === Lock-based

-==== Dealock
+==== 2-phased lock protocol
+
+*The Two-Phase Locking (2PL)* Protocol is a concurrency control method used in
+database systems to ensure serializability of transactions. The protocol
+involves two distinct phases: *Locking Phase (Growing Phase):*  A transaction
+may acquire locks but cannot release any locks. During this phase, the
+transaction continues to lock all the resources (data items) it needs to
+execute. \ *Unlocking Phase (Shrinking Phase):* The transaction releases locks
+and cannot acquire any new ones. Once a transaction starts releasing locks, it
+moves into this phase until all locks are released.
+
+==== Problems of locks 

 *Deadlock* is a condition where two or more tasks are each waiting for the
 other to release a resource, or more than two tasks are waiting for resources
-in a circular chain.
-
-==== Starvation
-
-*Starvation* (also known as indefinite blocking) occurs when a process or
-thread is perpetually denied necessary resources to process its work. Unlike
-deadlock, where everything halts, starvation only affects some while others
-progress.
+in a circular chain. 
+\ *Starvation* (also known as indefinite blocking) occurs
+when a process or thread is perpetually denied necessary resources to process
+its work. Unlike deadlock, where everything halts, starvation only affects some
+while others progress.

 === Timestamp-based

+*Timestamp Assignment:* Each transaction is given a unique timestamp when it
+starts. This timestamp determines the transaction's temporal order relative to
+others. *Read Rule:* A transaction can read an object if the last write
+occurred by a transaction with an earlier or the same timestamp. *Write Rule:*
+A transaction can write to an object if the last read and the last write
+occurred by transactions with earlier or the same timestamps.
+
 === Validation-based

+Assumes that conflicts are rare and checks for them only at the end of a transaction.
+*Working Phase:* Transactions execute without acquiring locks, recording all
+data reads and writes. *Validation Phase:* Before committing, each transaction
+must validate that no other transactions have modified the data it accessed.
+*Commit Phase:* If the validation is successful, the transaction commits and
+applies its changes. If not, it rolls back and may be restarted.
+
 === Version isolation

 = Logs