v2

2026-02-04 14:32:04 +00:00 · 2024-05-05 17:27:42 +03:00
parent d7d61184ea
commit c0b7597c98
2 changed files with 230 additions and 73 deletions
--- a/img/bitmap.png
+++ b/img/bitmap.png
--- a/main.typ
+++ b/main.typ
@@ -1,17 +1,17 @@
 #set page(margin: (
-  top: 1cm,
+  top: 0.6cm,
-  bottom: 1cm,
+  bottom: 0.6cm,
-  right: 1cm,
+  right: 0.6cm,
-  left: 1cm,
+  left: 0.6cm,
 ))
-#set text(7pt)
+#set text(6.2pt)
 #show heading: it => {
  if it.level == 1 {
    // pagebreak(weak: true)
-    text(10pt, upper(it))
+    text(8.5pt, upper(it))
  } else if it.level == 2 {
-    text(9pt, smallcaps(it)) 
+    text(8pt, smallcaps(it)) 
  } else {
    text(8pt, smallcaps(it)) 
  }
@@ -22,90 +22,225 @@
 == Bitmap
 Each bit in a bitmap corresponds to a possible item or condition, with a bit
 set to 1 indicating presence or true, and a bit set to 0 indicating absence or
 false.
 #figure(
  image("img/bitmap.png", width: 30%) 
 ) 
 == B+ tree
 *B+ tree* is a type of self-balancing tree data structure that maintains data
 sorted and allows searches, sequential access, insertions, and deletions in
 logarithmic time. It is an extension of the B-tree and is extensively used in
 databases and filesystems for indexing. B+ tree is *Balanced*; Order (n):
 Defined such that each node (except root) can have at most $n$ children
 (pointers) and at least $⌈n/2⌉$ children; *Internal nodes hold* between
 ⌈n/2⌉−1⌈n/2⌉−1 and n−1n−1 keys; Leaf nodes also hold between $⌈n/2⌉−1$ and
 $n−1$ keys but also store all data values corresponding to the keys; *Leaf
 Nodes Linked*: Leaf nodes are linked together, making range queries and
 sequential access very efficient.
 - *Insert (key, data)*:
    - Insert key in the appropriate leaf node in sorted order;
    - If the node overflows (more than $n−1$ keys), split it, add the middle
      key to the parent, and adjust pointers;
      + Leaf split: $1$ to $ceil(frac(n,2)) $ and $ceil(frac(n,2)) + 1 $ to
       $n$ as two leafs. Promote the lowest from the 2nd one.
      + Node split: $1$ to $ceil(frac(n+1, 2)) - 1 $ and $ceil(frac(n,2)) + 1 $.
        $ceil(frac(n+1, 2)) - 1 $ gets moved up.
    - If a split propagates to the root and causes the root to overflow, split
      the root and create a new root. Note: root can contain less than
      $ceil(frac(n,2)) - 1$ keys.
 - *Delete (key)*:
    - Remove the key from the leaf node.
    - If the node underflows (fewer than $⌈n/2⌉−1$ keys), keys and pointers are
      redistributed or nodes are merged to maintain minimum occupancy. -
    Adjustments may propagate up to ensure all properties are maintained.
 == Hash-index
 *Hash indices* are a type of database index that uses a hash function to
 compute the location (hash value) of data items for quick retrieval. They are
 particularly efficient for equality searches that match exact values.
 *Hash Function*: A hash function takes a key (a data item's attribute used for
 indexing) and converts it into a hash value. This hash value determines the
 position in the hash table where the corresponding record's pointer is stored.
 *Hash Table*: The hash table stores pointers to the actual data records in the
 database. Each entry in the hash table corresponds to a potential hash value
 generated by the hash function.
 = Algorithms 
 == Nested-loop join
-=== Overview
+*Nested Loop Join*: A nested loop join is a database join operation where each
 tuple of the outer table is compared against every tuple of the inner table to
 find all pairs of tuples which satisfy the join condition. This method is
 simple but can be inefficient for large datasets due to its high computational
 cost.
-=== Cost
+```python
 Simplified version (to get the idea)
 for each tuple tr in r: (for each tuple ts in s: test pair (tr, ts))
 ```
 // TODO: Add seek information
 Block transfer cost:  $n_r ∗ b_s + b_r$ block transfers would be required,
 where $b_r$ -- blocks in relation$r$, same for $s$.
 == Block-nested join
-=== Overview
+*Block Nested Loop Join*: A block nested loop join is an optimized version of the
 nested loop join that reads and holds a block of rows from the outer table in
 memory and then loops through the inner table, reducing the number of disk
 accesses and improving performance over a standard nested loop join, especially
 when indices are not available.
-=== Cost
+
 ```python
 Simplified version (to get the idea)
 for each block Br of r: for each block Bs of s:
  for each tuple tr in r: (for each tuple ts in s: test pair (tr, ts))
 ```
 // TODO: Add seek information
 Block transfer cost: $b_r ∗ b_s + b_r$, $b_r$ -- blocks in relation $r$, same
 for $s$.
 == Merge join
-=== Overview
+*Merge Join*: A merge join is a database join operation where both the outer
 and inner tables are first sorted on the join key, and then merged together by
 sequentially scanning through both tables to find matching pairs. This method
 is highly efficient when the tables are *already sorted* or can be *sorted
 quickly*, minimizes random disk access. Merge-join method is efficient; the
 number of block transfers is equal to the sum of the number of blocks in both
 files, $b_r + b_s$.
 Assuming that $bb$ buffer blocks are allocated to each relation, the number of disk
 seeks required would be $⌈b_r∕b_b⌉+ ⌈b_s∕b_b⌉$ disk seeks
 + Sort Both Tables: If not already sorted, the outer table and the inner table
  are sorted based on the join keys. 
 + Merge: Once both tables are sorted, the algorithm performs a merging
  operation similar to that used in merge sort:
    + Begin with the first record of each table.
    + Compare the join keys of the current records from both tables.
      + If the keys match, join the records and move to the next record in both tables.
      + If the join key of the outer table is smaller, move to the next record in
        the outer table.
      + If the join key of the inner table is smaller, move to the next record in
        the inner table.
    + Continue this process until all records in either table have been examined.
 + Output the Joined Rows;
 === Cost
 == Hash-join
-=== Overview
+*Hash Join*: A hash join is a database join operation that builds an in-memory
 hash table using the join key from the smaller, often called the build table,
 and then probes this hash table using the join key from the larger, or probe
 table, to find matching pairs. This technique is very efficient for *large
 datasets* where *indexes are not present*, as it reduces the need for nested
 loops. 
-=== Cost
+- $h$ is a hash function mapping JoinAttrs values to ${0, 1, … , n_h}$, where
  JoinAttrs denotes the common attributes of r and s used in the natural join.
 - $r_0$, $r_1$, … , rnh denote partitions of r tuples, each initially empty.
  Each tuple $t_r in r$ is put in partition $r_i$, where $i = h(t_r [#[JoinAttrs]])$.
 - $s_0$, $s_1$, ..., $s_n_h$ denote partitions of s tuples, each initially empty.
  Each tuple $t_s in s$ is put in partition $s_i$, where $i = h(t_s [#[JoinAttrs]])$.
 Cost of block transfers: $3(b_r + b_s) + 4 n_h$. The hash join thus requires
 $2(⌈b_r∕b_b⌉+⌈b_s∕b_b⌉)+ 2n_h$ seeks. 
 $b_b$ blocks are allocated for the input buffer and each output buffer.
 + Build Phase:
    + Choose the smaller table (to minimize memory usage) as the "build table."
    + Create an in-memory hash table. For each record in the build table,
      compute a hash on the join key and insert the record into the hash table
      using this hash value as an index.
 + Probe Phase:
    + Take each record from the larger table, which is often referred to as the
      "probe table."
    + Compute the hash on the join key (same hash function used in the build
      phase).
    + Use this hash value to look up in the hash table built from the smaller
      table.
    + If the bucket (determined by the hash) contains any entries, check each
      entry to see if the join key actually matches the join key of the record
      from the probe table (since hash functions can lead to collisions).
 + Output the Joined Rows.
 = Relational-algebra
 == Equivalence rules
 - *Commutativity*: $R∪S=S∪R$; Intersection: $R∩S=S∩R$; Join: $R join S=S
  join R$; Selection : $ sigma p_1( sigma p_2(R))= sigma p_2( sigma p_1(R))$.
 - *Associativity*: $(R∪S)∪T=R∪(S∪T)$; Intersection: $(R∩S)∩T=R∩(S∩T)$;
  Join: $(R join S) join T=R join (S join T)$; Theta joins are associative in
  the following manner: $(E_1  join_theta_1 E_2)  join_(theta_2 and theta_3)
  E_3 ≡E_1  join_(theta_1 or theta_3) (E_2 join_theta_2 E_3)$ 
 - *Distributivity*: Distributivity of Union over Intersection:
  $R∪(S∩T)=(R∪S)∩(R∪T)$; Intersection over Union: $R∩(S∪T)=(R∩S)∪(R∩T)$ Join over
  Union: $R join (S∪T)=(R join S)∪(R join T)$; Selection Over Union:
  $ sigma p(R∪S)= sigma p(R)∪ sigma p(S)$; Projection Over Union: $pi c(R∪S)=pi c(R)∪pi c(S)$;
 - Selection and Join Commutativity:  $ sigma p(R join S)= sigma p(R) join S$ if
  p involves only attributes of R
 - Pushing Selections Through Joins:  $ sigma p(R join S)=( sigma p(R)) join S$
  when p only involves attributes of R
 - Pushing Projections Through Joins: $pi c(R join S)=pi c(pi_(c sect #[attr])
  (R) join pi_(c sect #[attr]) (S))$ 
-== Operations
+// FROM Database concepts
 + $σ_(θ_1∧θ_2)(E) ≡σ_(θ_1) (σ_(θ_2)(E))$ 
 + $σ_(θ_1)(σ_(θ_2)(E)) ≡σ_(θ_2)(σ_(θ_1)(E))$ 
 + $Π_(L_1)(Π_(L_2)(… (Π_(L_n)(E)) …)) ≡Π_(L_1)(E)$ -- only the last one matters.
 + Selections can be combined with Cartesian products and theta joins: $σ_θ(E_1
  × E_2) ≡E_1 ⋈_θ E_2$ - This expression is just the definition of the theta
  join |||| $σ_(θ_1)(E_1 ⋈_(θ_2) E_2) ≡E_1 ⋈_(θ_1) ∧ θ_2 E_2$ 
 + $E_1 ⋈_θ E_2 ≡E_2 ⋈_θ E_1$ 
 + Join associativity: $(E_1 ⋈ E_2) ⋈ E_3 ≡E_1 ⋈(E_2 ⋈E_3)$ |||| $(E_1  join_theta_1
  E_2)  join_(theta_2 and theta_3) |||| E_3 ≡E_1  join_(theta_1 or theta_3) (E_2
  join_theta_2 E_3)$ 
 + Selection distribution: $σ_(θ_1)(E_1 ⋈_θ E_2) ≡(σ_(θ_1) (E_1)) ⋈_θ E_2$;
  $σ_(θ_1∧θ_2)(E_1 ⋈_θ E_2) ≡(σ_(θ_1)(E_1)) ⋈_θ (σ_(θ_2)(E_2))$
 + Projection distribution: - $Π_(L_1∪L_2) (E_1 ⋈_θ E_2) ≡(Π_(L_1(E_1)) ⋈_θ
  (Π_(L_2)(E_2))$ ||||  $Π(L_1∪L_2) (E_1 ⋈_θ E_2) ≡Π_(L_1∪L_2) ((Π_(L_1∪L_3) (E_1))
  ⋈_θ (Π_(L_2∪L_4) (E_2)))$
 + Union and intersection commmutativity: E1 ∪E2 ≡E2 ∪E1 |||| - E1 ∩E2 ≡E2 ∩E1
 + Set union and intersection are associative: (E1 ∪E2) ∪E3 ≡E1 ∪(E2 ∪E3) |||| (E1
  ∩E2) ∩E3 ≡E1 ∩(E2 ∩E3);
 + The selection operation distributes over the union, intersection, and
  set-difference operations: σθ(E1 ∪E2) ≡σθ(E1) ∪σθ(E2) |||| σθ(E1 ∩E2) ≡σθ(E1)
  ∩σθ(E2) |||| σθ(E1 −E2) ≡σθ(E1) −σθ(E2) |||| σθ(E1 ∩E2) ≡σθ(E1) ∩E2 |||| σθ(E1 −E2) ≡σθ(E1)
  −E2 |||| 12. 
 + The projection operation distributes over the union operation - ΠL(E1
  ∪E2) ≡(ΠL(E1)) ∪(ΠL(E2))
- Projection ($pi$). Syntax: $pi_{#[attributes]}(R)$. Purpose: Reduces the
+// == Operations
-  relation to only contain specified attributes. Example: $pi_{#[Name,
+//
-  Age}]}(#[Employees])$
+// - Projection ($pi$). Syntax: $pi_{#[attributes]}(R)$. Purpose: Reduces the
-
+//   relation to only contain specified attributes. Example: $pi_{#[Name,
- Selection ($sigma$). Syntax: $sigma_{#[condition]}(R)$. Purpose: Filters rows
+//   Age}]}(#[Employees])$
-  that meet the condition. Example: $sigma_{#[Age] > 30}(#[Employees])$
+//
-
+// - Selection ($sigma$). Syntax: $sigma_{#[condition]}(R)$. Purpose: Filters rows
- Union ($union$). Syntax: $R union S$. Purpose: Combines tuples from both
+//   that meet the condition. Example: $sigma_{#[Age] > 30}(#[Employees])$
-  relations, removing duplicates. Requirement: Relations must be
+//
-  union-compatible.
+// - Union ($union$). Syntax: $R union S$. Purpose: Combines tuples from both
-
+//   relations, removing duplicates. Requirement: Relations must be
- Intersection ($sect$). Syntax: $R sect S$. Purpose: Retrieves tuples common
+//   union-compatible.
-  to both relations. Requirement: Relations must be union-compatible.
+//
-
+// - Intersection ($sect$). Syntax: $R sect S$. Purpose: Retrieves tuples common
- Difference ($-$). Syntax: $R - S$. Purpose: Retrieves tuples in R that are
+//   to both relations. Requirement: Relations must be union-compatible.
-  not in S. Requirement: Relations must be union-compatible.
+//
-
+// - Difference ($-$). Syntax: $R - S$. Purpose: Retrieves tuples in R that are
- Cartesian Product ($times$). Syntax: $R times S$. Purpose: Combines tuples
+//   not in S. Requirement: Relations must be union-compatible.
-  from R with every tuple from S.
+//
-
+// - Cartesian Product ($times$). Syntax: $R times S$. Purpose: Combines tuples
- Natural Join ($join$). Syntax: $R join S$. Purpose: Combines tuples from R
+//   from R with every tuple from S.
-  and S based on common attribute values.
+//
-
+// - Natural Join ($join$). Syntax: $R join S$. Purpose: Combines tuples from R
- Theta Join ($join_theta$). Syntax: $R join_theta S$. Purpose: Combines tuples
+//   and S based on common attribute values.
-  from R and S where the theta condition holds.
+//
-
+// - Theta Join ($join_theta$). Syntax: $R join_theta S$. Purpose: Combines tuples
- Full Outer Join: $R join.l.r S$. Left Outer Join: $R join.l S$.
+//   from R and S where the theta condition holds.
-  Right Outer Join: $R join.r S$. Purpose: Extends join to include non-matching
+//
-  tuples from one or both relations, filling with nulls.
+// - Full Outer Join: $R join.l.r S$. Left Outer Join: $R join.l S$.
 //   Right Outer Join: $R join.r S$. Purpose: Extends join to include non-matching
 //   tuples from one or both relations, filling with nulls.
 = Concurrency 
@@ -171,7 +306,9 @@ conflict serializable.
 - *Read committed* allows only committed data to be read, but does not require re- peatable reads. 
 - *Read uncommitted* allows uncommitted data to be read. Lowest isolation level allowed by SQL.
-== Schedule
+
 == Protocols
 We say that a schedule S is *legal* under a given locking protocol if S is a possible
 schedule for a set of transactions that follows the rules of the locking protocol. We say
@@ -179,27 +316,47 @@ that a locking protocol ensures conflict serializability if and only if all lega
 are *conflict serializable*; in other words, for all legal schedules the associated →relation
 is acyclic.
 == Protocols
 === Lock-based
-==== Dealock
+==== 2-phased lock protocol
 *The Two-Phase Locking (2PL)* Protocol is a concurrency control method used in
 database systems to ensure serializability of transactions. The protocol
 involves two distinct phases: *Locking Phase (Growing Phase):*  A transaction
 may acquire locks but cannot release any locks. During this phase, the
 transaction continues to lock all the resources (data items) it needs to
 execute. \ *Unlocking Phase (Shrinking Phase):* The transaction releases locks
 and cannot acquire any new ones. Once a transaction starts releasing locks, it
 moves into this phase until all locks are released.
 ==== Problems of locks 
 *Deadlock* is a condition where two or more tasks are each waiting for the
 other to release a resource, or more than two tasks are waiting for resources
-in a circular chain.
+in a circular chain. 
-
+\ *Starvation* (also known as indefinite blocking) occurs
-==== Starvation
+when a process or thread is perpetually denied necessary resources to process
-
+its work. Unlike deadlock, where everything halts, starvation only affects some
-*Starvation* (also known as indefinite blocking) occurs when a process or
+while others progress.
 thread is perpetually denied necessary resources to process its work. Unlike
 deadlock, where everything halts, starvation only affects some while others
 progress.
 === Timestamp-based
 *Timestamp Assignment:* Each transaction is given a unique timestamp when it
 starts. This timestamp determines the transaction's temporal order relative to
 others. *Read Rule:* A transaction can read an object if the last write
 occurred by a transaction with an earlier or the same timestamp. *Write Rule:*
 A transaction can write to an object if the last read and the last write
 occurred by transactions with earlier or the same timestamps.
 === Validation-based
 Assumes that conflicts are rare and checks for them only at the end of a transaction.
 *Working Phase:* Transactions execute without acquiring locks, recording all
 data reads and writes. *Validation Phase:* Before committing, each transaction
 must validate that no other transactions have modified the data it accessed.
 *Commit Phase:* If the validation is successful, the transaction commits and
 applies its changes. If not, it rolls back and may be restarted.
 === Version isolation
 = Logs