|
return to content page/computer science home
Topics:
- Relational Theory
- Oracle Database Administration
- Temporal Databases
- Non-relational Data Models
- Distributed Databases
- Data Mining
- Data Warehousing
- Multi-dimensional Indexing
- The ANSI/SPARC Architecture
- Relations
- relation scheme R = {A1,...,An}
(list of attribute names)
- for each Ai, there is a domain of Ai
= Di = dom(Ai) (set of possible
values for that attribute)
- a relation r on a relation scheme R:
r = {t1,...,tp} (a tuple):
t(Ai) Di
- Functional Dependencies
- X → Y: attribute Y of R is functionally dependent on
attribute X of R iff for each X-value, there is only one
legal Y-value
- e.g. {Person, time, date} → Place
- Heath's Theorem: Lossless decomposition
R[A,B,C] and A → B
then can get a lossless decomposition of R into
R1[A,B] and R2[A,C]
|
(An example of a lossy decomposition would be
R1[C,A] and R2[C,B])
- Testing for a Lossless/Non-additive Join
Given R → R1,R2,...,Rn
want to check that R1 R2
... Rn
Algorithm
- Create matrix: Ri × Aj, and
initialize each cell to bij
e.g. R1={A,B}, R2={A,C} and F={A→B}
| j=1 | j=2 | j=3 |
| A | B | C |
| i=1 | R1 |
b11 |
b12 |
b13 |
| i=2 | R2 |
b21 |
b22 |
b23 |
- For each row, Ri:
for each attribute
Aj Ri, set
cell to aij
e.g.
R1={A,B} → |
| A | B | C |
| R1 |
a11 |
a12 |
b13 |
| R2 |
b21 |
b22 |
b23 |
|
R2={A,C} → |
| A | B | C |
| R1 |
a11 |
a12 |
b13 |
| R2 |
a21 |
b22 |
a23 |
|
- Repeat till no change:
- for each FD, X → Y
F:
- for all rows which have the same values (a's or b's),
make their Y columns the same:
- if at least one is an a, set all to aj
- else choose an i and set all to bij
e.g. F={A→B}
| A | B | C |
| R1 |
a11 |
a12 |
b13 |
| R2 |
a21 |
b22 |
a23 |
|
A→B → |
| A | B | C |
| R1 |
a11 |
a12 |
b13 |
| R2 |
a21 |
a12 |
a23 |
|
- If there exists a row that has all a's in it then the join
is lossless (non-additive)
- Inference Axioms
- Maier's List
- Reflexivity
- Augmentation
- Additivity/Union
- Projectivity/Decomposition
- Transitivity
- Pseudotransitivity
|
- X→X
- X→Y implies XZ→Y (or XZ →YZ)
- X→Y and X→Z implies X→YZ
- X→YZ implies X→Y
- X→Y and Y→Z implies X→Z
- X→Y and YZ→W implies XZ→W
|
- Armstrong's Axioms
1, 2 and 6 above: Reflexivity, Augmentation and
Pseudotransitivity
sound and complete
- Closure of a Set of Dependencies, F+
- F+ = closure on a set of dependencies, F
= all the FDs implied by F
- includes trivial dependencies, e.g. A→A etc.
- if F and G are FDs on a relation R and F+ =
G+
then F≡G and F is cover for G and G is cover for F
- Maximum number of FDs that a relation R can satisfy is
(2n-1)2 since for a FD such as X→Y,
both X and Y can be any combination of n attributes
- Minimal Cover
G is minimal cover (an irreducible set) of F if G is a
covering set of the FDs in F and:
- every RHS of the FDs in G is a single attribute
- every LHS is irreducible (no redundencies)
- no FD in G is redundent
- Closure of a Set of Attributes, A+
- Calculation of closure of a set of attributes, A
Algorithm
- oldA = {}
- newA = A
- while oldA != newA
- oldA = newA
- for each FD, X→Y, in F
- if X
newA
- newA = newA
Y
- return newA
- Does a set of FDs, F imply X→Y?
Compute X+ (the closure of X on F)
if Y X+ then true
else false
- Keys
- superkey
X is a superkey of R if:
X →
A1A2...An F+
- candidate key
X is a candidate key of R if:
- X is a superkey of R and
- for no proper subset Y
X
is Y → A1A2...An F+ (i.e. cannot discard a
single A in X to leave a superkey)
To determine if a set of attributes, A, can be a candidate
key of a relation, R, given a set of FDs, F:
- calculate A+ using F
- if A+ includes all the attributes in R then
yes, A is a candidate key
- Preservation of Dependencies
F is a set of FDs for a relational scheme R
F+ is the closure of F
| Π(Z)(F) |
= projection of F onto a set of attributes Z |
| = dependencies X → Y in F+:
XY Z |
Therefore, to test preservation of dependencies when splitting
R into R1, R2, etc.
- calculate F+
- calculate F1 =
Π(R1)(F),
F2 = Π(R2)(F), etc.
- Is F+ = F1
F2...?
- Normal Forms
- N1NF (Non First Normal Form)
e.g. {name,email}={("Alice","alice@orcon, ahol@cs")}
- 1NF (First Normal Form)
A relation is in 1NF if the domains of all attributes
contain only atomic values and
the value of any attribute in a tuple is a single value
from the domain.
e.g.1.
{name,email}={("Alice","alice@orcon"),("Alice","ahol@cs")}
e.g.2. R[ABCDE], F={A→C,B→E,AB→D}, CK=AB
- 2NF (Second Normal Form)
A relation is in 2NF if it is in 1NF and every non-prime
attribute is fully functionally dependent on the primary
key (i.e. no partial dependencies:
).
e.g. R[ABCD], F={AB→C,AB→D,C→D},CK=AB
- 3NF (Third Normal Form)
A relation is in 3NF if it is in 2NF and no non-prime
attribute is transitively dependent on any key, or in
other words:
A relation R is in 3NF if, for every non-trivial FD,
X→A either
- X is a superkey of R, or
- A is a prime attribute of R
e.g. R[SCP], F={SC→P,P→C}, CK=SC
- BCNF (Boyce-Codd Normal Form)
A relation is in BCNF if it is in 3NF and every
determinant (X→Y, X is the determinant) is a
candidate key, or in other words:
A relation R is in BCNF if it is in 2NF and, for every non-trivial FD,
X→A
e.g. R[ABCD],
F={A→B,B→A,A→C,A→D,B→D}, CK=A|B
- 4NF (Fourth Normal Form)
Multi-valued dependencies (MVD) have been removed.
A→→B (A multidetermines B) iff whenever there
are two tuples, say t1 and t2, where
the values for A are the same, but B and C may be
different, then there will exist two other tuples,
t3 and t4, that give the two different
combinations of B and C values:
| A | B | C |
| t1 |
a1 |
b1 |
c1 |
| t2 |
a1 |
b2 |
c2 |
| t3 |
a1 |
b1 |
c2 |
| t4 |
a1 |
b2 |
c1 |
|---|
And we write A→→B or equivalently
A→→C, or A→→B|C
- 5NF or PJNF (Fifth Normal Form or Project Join Normal
Form)
Every join dependency in R is a consequence of the
candidate keys of R.
- Normalisation
- Point of Normalisation
Remove certain kinds of redundencies, avoid certain update
anomalies, to produce an intuitive representation of the
real world and to simplify the enforcement of integrity constraints.
- Decomposition into 3NF
- begin with a relation scheme R and F, a minimal
cover set of dependencies
- remove attributes in R not involved in F to a separate
scheme
- for each dependency, X→A in F, create a relation
scheme [XA] (note, if have
X→A1,X→A2 etc., combine in
a single scheme: [XA1A2...])
- Decomposition into BCNF
For each functional dependency, X→A where X is not a
superkey of R, separate into the relations
R1[XA] and R2[R-A].
R2[R-A] may not be in BCNF so may have to
repeat.
- Decomposition into 4NF
Remove multi-valued dependencies: if R[ABC] and
A→→B|C then split up into R1[AB]
and R2[AC]
return to top
- An Oracle Server
- An Oracle Database
Data files, redo log files, security control files, a
parameter/init file, trace files, an alert log.
- An Oracle Instance
The dynamic part: Processes (both SYSTEM and user).
Consists of a System Global Area (SGA - includes a DB
buffer cache and a Redo log buffer), server processes
produced to handle user requests, and a set of
background processes (includes a database writer, a log
writer, checkpoint, system and process monitors, an
archiver, recoverer processes, dispatchers and lock processes).
- Role of the DBA
A data administrator makes the strategic and policy decisions
about the data, the DBA implements these.
- Defines both the conceptual schema (tables, views,
indices, synonyms etc.) and the
internal schema (Tablespaces - Oracle will automatically
create a SYSTEM tablespace, at least one user tablespace
is also required), and the mapping between the two.
- Liase with users.
- Defines the security and integrity constraints.
- Defining dump and reload policies: recovery.
- Monitoring performance and responding to changing
requirements (e.g. new user, add column to a table, etc.)
Note: Oracle produces a few tables automatically: all_tables,
dba_users...
- Physical Storage in Oracle
Data block (logical block, page, Oracle block) - fixed size
Extent - specific number of contiguous blocks
Segment - a set of extents used for a certain logical
structure, e.g. data segment, index segment, rollback segment...
- Using the Server Manager
- Commands available
As DBA, use the server manager (
svrmgrl to run
the command line version) to administer the database.
CONNECT|DISCONNECT |
First thing to do:connect internal,
can also create a dummy user and connect as them to check how
things are going, or connect as sysdba or SYSTEM |
STARTUP|SHUTDOWN |
the oracle instance - need to keep this running
for your users, to check:
ps auwx | grep -i ora |
ARCHIVE LOG |
start | stop | list | next | all |
'destination' |
RECOVER |
database [manual] | tablespace
tsname[,tsname2] |
SET|SHOW |
instance | echo | termout | timing | numwidth |
charwidth | longwidth | datewidth | autoprint, in
addition SHOW has all | spool |
EXIT | from svrmgrl |
and SQL statements (note: to run a set of SQL statement
from file type: @filename.sql).
- Users
- Creating a new user
create user username identified
by password default tablespace tbsname
temporary tablespace tbstempname;
- Changing a user's password
alter user username identified
by newpassword;
- System Privileges
- Granting
grant [connect][resource] to username;
Connect just allows a user to login, resource allows
them to build their own tables etc. Usually a user
will need both.
- Revoking
revoke [connect][resource] from username;
- Checking
select * from dba_role_privs [where grantee =
'username'];
- Object Privileges
- Granting
grant [select][insert][delete][all] on
owner.tablename to username|rolename|PUBLIC [with
grant option];
- Revoking
revoke [select][insert][delete][all] on
owner.tablename from username|rolename|PUBLIC;
- Checking
select * from dba_tab_privs where grantee =
'username';
- Synonyms and Object Ownership
- Checking Object Ownership
select owner, object_name, object_type from
dba_objects where [grantee = 'username'] | [object_name =
'tablename'];
Determines either what a particular user owns, or who
owns a particular table.
- Creating
To avoid users having to know the owner of a table they
have object privileges on, we can create synonyms:
create public synonym
syntablename for owner.tablename;
- Dropping
drop public synonym syntablename;
- Roles
- Creating
To simplify the granting/revoking of privileges we can
create roles. Then users are granted roles and roles are
granted priveleges (just like users are).
create role
rolename;
- Applying to users
grant rolename to username;
- Altering or dropping a role
Altering a role can be quite messy. It may be easier
to simply drop a role and then create it again.
drop role rolename;
- Backing Up and Recovery
Regular backups shoule be made. This can be done by exporting
the whole database on a regular basis to a flat .dmp file (using
the
exp command). Recovery can be accomplished
by importing (using the imp command) the backed
up file. In linux, can use crontab to automatically run a
bash script which does the backups.
return to top
- Temporal Concepts
Most DBs simply overwrite old data when updating. Need a TBDMS
(temporal database management system) and associated TSQL.
- Time, Granularity and the Chronon
Time may be discrete or continuous, but in a database
application it will be discrete. The granularity of time
(e.g. nearest ms|s|min|h|d|y etc) is specific to an
application. A chronon is the smallest unit
of time used. An interval is a set of successive
chronons. An event that occurs instantaneously will exist
for one chronon.
- Time Dimensions
- Valid time
also real-world time, intrinsic time, logical time,
data time.
The real-world time during which an object exists or
an event takes place.
- Transaction time
also registration time, extrinsic time, physical time.
The DBMS time at which an event was entered into the
temporal database.
- User-defined time
A domain supported by the data model, as defined by
the user, e.g. dob.
- Allen's Temporal Operators and TSQL2
Used to describe temporal relationships between two time
intervals, e.g. X and Y:
| Allen's operator | | TSQL2 operator |
| X EQUALS Y |
XXX
YYY
| = |
| X BEFORE Y |
XXX
YYY
| X PRECEDES Y |
| X MEETS Y |
XXX
YYY
| END(X) = BEGIN(Y) |
| X OVERLAPS Y |
XXX
YYY
| X OVERLAPS Y AND
END(X) PRECEDES END(Y) |
| X DURING Y |
XXX
YYYYY
| BEGIN(X) PRECEDES BEGIN(Y) AND END(X)
PRECEDES END(Y) |
| X STARTS Y |
XXX
YYYYYY
| BEGIN(X) = BEGIN(Y) AND END(X)
PRECEDES END(Y) |
| X FINISHES Y |
XXX
YYYYYY
| BEGIN(X) PRECEDES BEGIN(Y) AND END(X)
= END(Y) |
In TSQL2, to refer to the valid time interval of a table
use:
VALID(tablename)
Thus, to refer to the specific v_start and v_end columns of a
table use:
BEGIN(VALID(tablename))
END(VALID(tablename))
Then can use the PRECEDES, =, OVERLAPS or CONTAINS operators
to compare to another time interval (or
PERIOD(starttime, endtime).
- Valid Time (Historical) Databases
e.g. a dblog of when user sessions, so
v_start is their login
time and v_end is their logout time (before they
logout this is set to the current time symbol: ).
e.g. of a query: Who was logged in, and to which computer, at
11.05am?
SELECT d.user, d.computer
FROM dblog d
WHERE VALID(d) CONTAINS '11.05';
- Bitemporal Databases
Includes both valid time (v_start and
v_end) and transactional
time (t_start and t_end).
Nothing is ever deleted (only current times can get pinned down).
I think of these as belief revision databases where the
transactional period is the period over which a belief
is/was held. Thus it is obvious that two conflicting beliefs
cannot be held concurrently, so if need to change belief
at chronon c, then old belief must end at c-1 for new
belief to begin at c.
e.g.
- Monday: Three suppliers (S1, S2 and S3) contracted on Monday and loaded
into the database.
| snum | sname | city |
v_start | v_end | t_start | t_end |
| S1 | Smith | London |
Mon |  |
Mon |  |
| S2 | Jones | Paris |
Mon |  |
Mon |  |
| S3 | Blake | Paris |
Mon |  |
Mon |  |
Consider the
in the t_end column as denoting a current belief.
- Tuesday: Supplier S4 starts as a contractor, and DB is
updated. Supplier S1's contract is cancelled as from
today, and DB updated immediately.
| snum | sname | city |
v_start | v_end | t_start | t_end |
| S1 | Smith | London |
Mon | Mon |
Mon |  |
| S2 | Jones | Paris |
Mon |  |
Mon |  |
| S3 | Blake | Paris |
Mon |  |
Mon |  |
| S4 | Clark | London |
Tues |  |
Tues |  |
- Wednesday: Supplier S3 moves to New York and change noted
immediately. Supplier S57 starts.
| snum | sname | city |
v_start | v_end | t_start | t_end |
| S1 | Smith | London |
Mon | Mon |
Mon |  |
| S2 | Jones | Paris |
Mon |  |
Mon |  |
| S3 | Blake | Paris |
Mon | Tues |
Mon |  |
| S4 | Clark | London |
Tues |  |
Tues |  |
| S3 | Blake | New York |
Wed |  |
Wed |  |
| S57 | Heinz | Berlin |
Wed |  |
Wed |  |
Note that the belief
that S3 was in Paris between Mon and Tues inclusive has
not changed so we donot need to change the t_end.
- Thursday: The auditor comes: S2 didn't actually start
till Tues; S57 doesn't, and never did, exist; S1 is still in
Paris and always has been.
| snum | sname | city |
v_start | v_end | t_start | t_end |
| S1 | Smith | London |
Mon | Mon |
Mon | Wed |
| S2 | Jones | Paris |
Mon |  |
Mon | Wed |
| S3 | Blake | Paris |
Mon | Tues |
Mon |  |
| S4 | Clark | London |
Tues |  |
Tues |  |
| S3 | Blake | New York |
Wed |  |
Wed |  |
| S57 | Heinz | Berlin |
Wed |  |
Wed | Wed |
| S2 | Jones | Paris |
Tues |  |
Thurs |  |
| S1 | Smith | Paris |
Mon | Mon |
Thurs |  |
- Friday: S1 is now trading as Smith International of Rome
and is given a new contract.
| snum | sname | city |
v_start | v_end | t_start | t_end |
| S1 | Smith | London |
Mon | Mon |
Mon | Wed |
| S2 | Jones | Paris |
Mon |  |
Mon | Wed |
| S3 | Blake | Paris |
Mon | Tues |
Mon |  |
| S4 | Clark | London |
Tues |  |
Tues |  |
| S3 | Blake | New York |
Wed |  |
Wed |  |
| S57 | Heinz | Berlin |
Wed |  |
Wed | Wed |
| S2 | Jones | Paris |
Tues |  |
Thurs |  |
| S1 | Smith | Paris |
Mon | Mon |
Thurs |  |
| S1 | Smith Intl | Rome |
Fri |  |
Fri |  |
return to top
- The Network Model
- The CODASYL Model

DDL: Data Description Language, language
independent in the Schema level but language compatible in
the Sub-schema level
DML: Data Markup Language, always language
compatible
Advantage: each language could define its own DDL and DML.
But, language compatible DDL and DML only defined for COBAL.
- Bachman diagrams
Data model objects:
- record - group of related data values, a tuple
- set - describes a 1:N relationship between two
record types; has a name (type of relationship,
e.g. major_dept), an owner record (e.g a department) and a linked
list of member records (e.g. a list of students)
A set type consists of a name, an owner record type and a
member record type.
e.g.

- Sample Data
So, each record can have links to a parent/owner of a set (if it is the
last member record in a set), to the first child in a list
of member records of a set, or to the next member record
in a list of a set. Also, each record could be involved
in different sets - as members or owners.
- Data retrieval: Navigation
Uses pointers to current set and record in that set. Like any linked
list datatype, such a structure is fast for certain kinds
of retrieval processes, e.g. iteration.
- The Hierarchical Model
- Architecture
IBM's Information Maganagement System (IMS):
Based on business models. Records and parent-child
relationships (1:many, similar
to sets in the network model except that there are no
pointers back to the parent).
Very efficient for many applications, but
difficult to extend or alter once built (i.e. lacking in
relational flexibility to add/remove tables).
- Hierarchical Schemas
e.g.

- Sample Data
- The Object Model
The next data model!
The Object Oriented Database System Manifesto lists 13
mandatory features:
- OO Mandatory Features
- Complex objects
Made up from the simplest objects: integers,
characters, bytes, strings, booleans and floats.
Also need:
- tuples - represent properties of an entity
- sets - natural way of representing collections
- lists - or arrays, to represent order
- Object identity
An object's existence is independent of its value.
Thus identical objects are the same objects, but can also
have equal objects which are objects with the
same value. Differnet objects can also share a common
component so that one object can update the value of
this shared component.
- Encapsulation
Provides logical data independence: we can change
how a type is implemented without needing to change
the programs that access it.
- Types and classes
- type - compile time checking; summarizes common
features; abstract data type; interface and an implementation
- class - run time manipulating; object factory creates
new instances, object warehouse stores those
instances; can operate on entire warehouse at once
- Class or type hierarchies
Utilize inheritance, which reduces coding and helps
reusibility
e.g.
- Overriding, overloading and late binding
- Overriding e.g. in Picture, Person and Graph
object, each have a method called display()
- Overloading e.g. display(Picture), display(Person)
and display(Graph)
- Or we can just have display(Object) and let the
method itself work out the type of the Object and
display it accordingly.
- Computational completeness
Ability to express any computable funciton in the DML.
- Extensibility
Ability to define new types, and no distinction in usage
between system and user-defined types.
- DB Mandatory Features
- Persistence
Should be implicit - i.e. user should not have to
explicitly save the data to make it persist.
- Secondary storage management
Hidden (from user) performance features: index
management, data clustering and buffering, access path
selection, query optimization. DBA may have some
control in programming these.
- Concurrency
Multiple users and queries, so must still support
atomicity of a sequence of operations and controlled
sharing using locks etc.
- Recovery
Recovery from processor or disk failures.
- Ad hoc query facility
Provide a query language, as the DML or a subset of
the DML, that is:
- high level - express requests simply
- efficient - lends itself to optimization
- application independent - work on any database
- Gemstone and Opal
Gemstone is an Object Oriented system which uses Opal
as the data language.
- OO verses Relational Models
Advantages of OO:
- User-defined data-types - efficient type-specific representations
- Inheritence
- Reusibility of code
- Pure ODBMS - better represents semantics of real-world right in the
datalevel itself without the need for an extra layer
- Object Extended RDBMS - represents the semantics with
an extra layer mapping from the database to a
semantically rich data model
Disdvantages of OO:
- OIDs - we lose multi-values keys
- Lack of declarative integrity support (procedural only)
Extending the Relational model: e.g. have a student with a
name, address, academic_record, photo etc. Traditionally would store name
and address as strings with the academic_record and photo
as blogs. Would be better to produce ADT's for each of these.
- The ODMG Standard
The Object Data Management Group (ODMG) describes
three types of OO language: ODL (Object Data Language);
OML (Object Manipulation Language); OQL (Object Query
Language)
return to top
I decided not to study for this section as I didn't feel the
lectures gave me a good enough understanding of it.
return to top
- Data Mining verses OLAP
OLAP (on-line analytical processing) tests relationships between data
that someone has hypothesized (hypothesis testing), whereas data mining
finds relationships between data (hypothesis generation).
Data mining is involved in the description and prediction
of data.
A model is a general was of describing a dataset, e.g. y=mx+c.
A pattern is an instance of a model, e.g. y=2x+3.
- Segmentation
- Classification
Know the groups a priori. Supervised learning.
- Using neural networks
Feed-forward backpropagation networks. If no hidden layer then can
only perform linear regression. Use a test set to determine when
to stop training to prevent overfitting.
Advantages:
- parallel processing is easily used
Disadvantages:
- black box predictor (people want to know why)
- categorical explosion: use one input unit for each category
- can find false patterns if signal:noise ratio is low
Works best when:
- data set is large
- signal:noise ratio is high
- Using decision trees
Iterative splitting of data into discrete groups with maximal
distance between each group (e.g. as determined by chi-sqrd).
Decision nodes; leaf nodes; binary trees; multiway trees.
- classification trees - predict categorical data
- regression trees - predict continuous data
Advantages:
- can explain its predictions
- make relatively few passes through the data
- work well with many predictor variables
- good for large data sets
Disadvantages:
- interpretation may be difficult: different trees can give the
same outcome
Overfitting is also a problem for decision trees. We can use
stopping rules (e.g. limit the maximum depth, limit the minimum
number of records per node) or pruning (heuristics or user
intervention).
- Clustering
Discover the groups. Unsupervised learning. Find groups that are very
different from each other but whose members are all very similar.
- K-means
- Data types:
- continuous data
- categorical data
- ordinal - ordered, BUT distance between
points not guaranteed e.g. satisfaction rating
- nominal - e.g. poisonous/nonpoisonous,
post-codes, car make, colours (sometimes continuous)
- e.g. K-3 algorithm:
- pick 3 random points as centroids
- assign each remaining point to its closest
centroid - results in 3 clusters
- recalculate best centroid for each cluster
- repeat 2-3 till centroids don't move
Sensitive to choice of initial centroids so: pick
first centroid at random, second
to be furtherest in dataset from this point, and
third furtherest from these two points.
- Agglomerative clustering
Good for getting an idea of how many clusters we
should have. Basic algorithm:
- each data point begins in its own group: a singleton
- compare all pairs of groups and mark the group that is closest
- if distance between these two groups is <
some threshold then merge the two groups
- else done
If the threshold value is too low then may have too
many groups at the end, many singletons.
But is threshold is too large will get clusters
containing dissimilar items.
- K-medoids
The problem with K-means and agglomerative clustering
is that the centroids are not
necessarily meaningful for categorical data. So can
use, e.g. PAM (Partitioning Around
Medoids). Here we recalculate the medoid by swapping, BUT not very efficient.
- Dendrogram
For display of clustering results:
- Link Analysis
- Association discovery
| A | → | B |
| antecedent | | consequent |
| e.g. shopping basket: |
buy alcohol | | buy burger |
e.g. using the shopping basket: {(a,b,c),(a,c,d,f),(a,c),(b),(b,c)}
- support/prevalence
| supp(A) | = | Pr(A) |
| | = | number of
times A appears in table [/ num rows in table] |
| e.g. | supp(a) | = |
3/5 = 0.6 |
| e.g. | supp(a→b) | = |
1/5 = 0.2 |
- confidence
| Confidence(A→B) | = |
supp(AB)/supp(A) |
| e.g. | Confidence(a→b) | = |
1/3 |
| e.g. | Confidence(a→c) | = |
3/3 = 1 |
Low support thresholds leads to a rule explosion,
BUT lowe support plus high confidence
is important sometimes, e.g. diagnosis of rare diseases.
- lift
| Lift(A→B) | = |
Confidence(A→B)/supp(B) |
| e.g. | Lift(a→c) | = |
1/(4/5) = 5/4 |
Once have a good rule, lift is useful to calculate to determine if the
antecedent is a sufficient indicator of the consequent.
If confidence is high (so the rule is good) but lift is much bigger than
the confidence then to predict the consequent we
would need more such rules.
e.g. Confidence(a→c) is high(1) but
Lift(a→c) = 5/4 > Confidence(a→c)
- Sequence discovery
Association related over time.
- Decision-Tree Classification using SPRINT
- Overview
Attributes:
- categorical
- continuous
- classifying attribute = class
Each non-leaf node is a split point.
- growth phase - till each partition pure or small enough
- prune phase - <1% of computation time
- Partitioning
- The partition algorithm
Partition(Data data)
if data is pure (i.e. all records are in the same class)
return
for each attribute A
evaluate the best split point on attribute A
using the best split point, split data into datasets 1 and 2
Partition(dataset1)
Partition(dataset2)
- Continuous data
- All attrubute lists are sorted.
- To find the best split point on an attribute,
start at the beginning of the
attribute list and use two arrays (done and notDone)
which represent the counts of the classes before and
after the current cursor
position.
- Categorical data
Instead of entire attribute lists to scroll through
we can just calculate a single count matrix
for each category type. Here the rows are the
categories for the category type (e.g. for car
type, the rows may be: station_wagon, hatch_back,
sports, family etc) and the columns are for
the classes. We then recursively determine which
category value should be split on.
- Calculation of best split point
For two attrubutes we use the values for one of the
classes in these arrays to calculate the
best split point (that with the lowest
ginisplit) for
spitting the dataset d into datasets
d1 and
d2:
where n1 and n2
are counts of the number of items in this
class in datasets d1 and
d2 respectively, and
where p is the relative frequency of the
chosen class j in dataset d.
- Performing the split
Partition the attribute list of the winning
attribute, using a list of the
rid's that will go on one side (either always the LHS,
or calculate the side that will
hold the least number of records).
- Advantages of SPRINT over SLIQ
SLIQ-R - replicated class list. SLIQ-D - distributed
class list - performed worst.
SPRINT uses less memory than SLIQ.
- removes most memory restrictions
- fast
- scalable (c.f. SLIQ thrashes above a certain number of records
- paralletlizable
- Disadvantages
Is a greedy algorithm:
- it minimises diversity at each split, but locally
only...not look-ahead
- tends to stop at local minima - may consider using
NN's and simulated annealing
- Pruning
Want to prune if data overfitted or the tree is just
too big. Use a parameter t.
If t is large then get a smaller tree; if t is small then
get a larger tree. Either dial
t up and down, or calculate t when we chose to cut at
particular notes,
then test each pruned tree on the test data and
determine which is the best
t value. i.e. trial and error, and can lead to rules of thumb.
MDL (minimum description length) = size(tree) +
size(misclassifications(tree)).
Only grow if description length is reduced - so can be
used during splitting.
- Parallelizing the Classification
Each attribute list split over N processors.
- continuous data is sorted first then each processor gets
a contiguous section and so the initial countdone
depends on previous processors, and the initial
countnotDone is calculated from this and subsequent
processors.
- categorical data: local count matrix → global count matrix, and
coordinator processor evaluates the ginisplit
The hashtable/list of rids for splits needs to be
shared/communicated between all
processors.
- Association Mining
- Association Rules
Let I = a set of items.
Then A → B where A I and
B I and A
B = {}
- holds with confidence c if c% of transactions (rows) in the database
that contain A also contain B.
- has support s if s% of transactions (rows) in the database contain
A
B
- Subsets
For a set with n items, there are 2n subsets.
e.g. {1,2,3,4,5,6}, possible subsets = [{},{1},...,{6},
{1,2},...,{1,6},{2,3},...,{2,6},...]
| 1 | 2 | 3 | 4 | 5 | 6 |
| {} | 0 | 0 | 0 | 0 | 0 | 0 |
| {1} | 1 | 0 | 0 | 0 | 0 | 0 |
| {2} | 0 | 1 | 0 | 0 | 0 | 0 |
| etc. |
- Apriori algorithm
- Algorithm
L1 = {large (count >= min_support) 1-itemsets}
for (k = 1; Lk != {}; k++)
-
| Ck+1 | = |
join Lk with Lk
(each itemset with first k-1 items, the
prefix, in common and last item
not in common) then prune out ck+1
where any subset of this made by removing one item
is not in Lk |
| for each transaction in the
database |
| add to count of c Ck+1 if the transaction
contains this subset c |
| Lk+1 | = |
Ck+1 only keeping those whose count
is > min_support |
Really expensive as everything needs to calculate support
for items in the candidate set, Ck,
must scan the entire database.
- Example
e.g. transaction records(as <tid, {items}>) =
{<100, {1,2,3,4}>, <200, {1,3,4,5,6}>, <300,
{2,3,4}>, <400, {3,5,6,7}>, <500,
{1,3,4,6}>}, and let min_support = 3 and min_confidence
= 50%
| 1-itemsets |
2-itemsets |
3-itemsets |
| item | support |
items | support |
items | support |
| {1} | 3 |
{1,2} | 1 |
{1,3,4} | 3 |
| {2} | 2 |
{1,3} | 3 |
{1,3,6} | 2 |
| {3} | 5 |
{1,4} | 3 |
{1,4,6} | 2 |
| {4} | 4 |
{1,5} | 1 |
{2,3,4} | 2 |
| {5} | 2 |
{1,6} | 2 |
{3,4,5} | |
| {6} | 3 |
{2,3} | 2 |
{3,4,6} | 2 |
| {7} | 1 |
{2,4} | 2 |
{3,5,6} | 2 |
|
{2,5} | 0 |
|
|
{2,6} | 0 |
|
|
{3,4} | 4 |
|
|
{3,5} | 2 |
|
|
{3,6} | 3 |
|
|
{4,5} | 1 |
|
|
{4,6} | 2 |
|
|
{5,6} | 2 |
|
Thus, L1 = {{1}, {2}, {3}, {4}, {5}, {6}} since set
{7} does have enough support (must be >= 2).
L2 = {{1,3}, {1,4}, {1,6}, {2,3}, {2,4}, {3,4},
{3,5}, {3,6}, {4,6}, {5,6}} since only these subsets have
enough support in the transaction data. Subset {3,4,5} was
eliminated from C3 because removing item 3
would have resulted in subset {4,5} which is not in
L2. Thus, L3 = {{1,3,4}, {1,3,6},
{1,4,6}, {2,3,4}, {3,4,6}, {3,5,6}}. Only the first two
subsets of these can be combined (since they both have the
prefix {1,3} in common) to give set {1,3,4,6}. All of
the subsets of this set produced by removing a single item
(1 or 3 or 4 or 6) results in a subset that is present in
L3. Furthermore, this set also has a support
of 2 and so L4 = {{1,3,4,6}}. Thus out of a
possible 27=128 subsets, we only generated 30,
with 23 having minimum support.
- Generation of rules
Taking the only large 4-itemset we can partition the items
in the set between the antecedent and consequent of a rule
and calculate suport. In practice we only generate those
rules where the consequent has one item (and if this item
was the class then we are building a classifier).
e.g. supp({1,3,4,6}) = 2; confidence=supp(AB)/supp(A)
| A → B | supp(A) | confidence |
| 3,4,6→1 | 2 | 100% |
| 1,4,6→3 | 2 | 100% |
| 1,3,6→4 | 2 | 100% |
| 1,3,4→6 | 3 | 67% |
- AprioriTid algorithm
The database is not used to calculate support after the
first pass. Rather we store the required information in a
set Cbark in the form <tid,
{Xk}>, where Xk is the set of potentially
large k-itemsets that are present in the
transaction. Ck+1 is still computed from
Lk, but Cbark+1 is computed from
Cbark and Ck+1. The advantages of
AprioriTid only eventuate when Cbark can fit
into memory but the entire database cannot.
- Apriori-hybrid
Start by using the Apriori algorithm until
Cbark could fit into memory, then switch to
AprioriTid. However, there is a cost associated with this
switch and so if we are just at the end of the search then
this can be a waste.
- Rare items
Very low support but very high confidence. Not
supported by these algorithms. May be useful for
e.g. diagnosis of rare diseases; expensive items in a
store (could give these a high weight and use a
weighted count to determine support).
- Lift
Lift(A→B) = Pr(AB)/(Pr(A)Pr(B)). Density of attribute/class
relative to its density in the entire database.
e.g. 1,3,4→6. Pr(6) in db = 3/6; Pr(1,3,4) in
db = 3/6; Pr(1,3,4,6) = 2/6; Lift = 2/6 / (3/6 * 3/6) =
12/9 = 4/3. Not very interesting then
return to top
- Impetus for data warehousing
- Decision support system
- tool(s) which allows business end users to perform
computer generated data analysis on their own; includes
reporting, OLAP,... ; aim for fast
and accurate retrieval
- OLTP
- on-line transaction processing; needs fast throughput
If want to test a hypothesis against the database, don't
want to interrupt the OLTP as this needs fast throughput, AND
if want OLAP to be fast so transaction data would be too slow
to go through, so summarize first
- data warehouse
- a copy of transaction data specifically structured for
querying and reporting
- Generation of warehouse summaries
- passive/lazy
- only gets data after a query requiring it has arrived
- active/eager
- gets data as it is produced, or periodically, and before
queries for it arrive; the approach used by data warehouses
- Architecture
Architecture of the Stanford data warehousing project
(WHIPS)
compilation component - user-defined content of the data
warehouse (monitors, integrator) - want to automate this too
- Monitors
How to tell when new data/change in information source:
- triggers in information source notifies monitor
- every application that changes the information source also
notifies the monitor (not such a great idea as may be a lot of
these apps)
- use periodical source data dumps into a file - already
have this facility for backing up, but get the snapshot
differential problem
- Snapshot differential problem
Two snapshots of the database as flat files: F1 at
day 1, and F2 at day 2:
- rows = <K, B> (where K = key)
- compare files to get a stream of change notifications
updated (Ki, B1, B2)
deleted (Ki, B1)
inserted (Ki, B2)
- problem: rows with the same key may appear in
different places in F1 and F2
- not always that bad, e.g. sliding window algorithm
finds K1 in F1 and looks in first N
rows in F2, doesn't find the key so creates a
deleted notification. Later finds and
creates an inserted notification. End
result: K1 is still in the DW so no inconsistencies.
- if B very large then hash to small signature - may
miss a few updates but probability is low
- problem is continuous so when comparing F1
and F2 to make the change notifications, also
create an auxiliary file for F2 to make the
next comparison easier
- sort-merge join, hash join, UNIX diff
- Integrator
- rule-based
- each rule handles one type of change notification
- may need additional data from DW, especially if DW is of
average or summary data (e.g. average salary and have updated
the salary of an employee)
- distributed object system: WH, integrator, monitors,
information sources can all be on different machines and
programmed in different languages
- Warehouse update anomolies
In a normal database we have ACID transactions where tables
are locked preventing update anomolies. However in a DW the
data is remote and we don't always have the luxury of table
locking (sometimes we might if all the datasources are in the
same timezone and we can do it in the wee hours of the morning
when they're not being used).
If integrator issues a query back to an information source
(since may need more information for an update) but the query
is issued after a modification in the source so source may
have changed since then. Solution? recompute the view?
| table T1 |
table T2 |
Data warehouse |
|
|
w(T1
T2) = [1] i.e. project w from the
natural join (on x, to give <x,y,w>) of the two
tables |
|
update(2,3)
|
update notification: u(T2[2,3])
now WH sends query to T1:
Query(T1 [2,3]) |
insert(4,2)
|
|
T1 returns [1][4] to query.
also, update notification:
i(T1[4,2])
so WH sends query to T2:
Query([4,2]
T2)
T2 returns [4] to query.
So, in DW we now have [1][4][4] where we should just
have [1][4] |
ECA (Eager Compensating Algorithm) - queue of queries at
unanswered end; only works when all tables in one datasource
- Data reconciliation
Integrator must also reconcile data from different schemas.
e.g. database 1: sex:0|1, database 2: sex:M|F, ...
- Advantages
- DW queries don't hold up the daily transactions on the
businesses online/live databases (OLTP)
- can construct the DW to speed up the querying/reporting process
- can design the DW to simplify the querying/reporting process
- can chose to store only cleaned up transactional data
without having to fix the OLTP systems
- querying data from multiple, disparate sources
- DW can hold older (archival, historic) data too that may have been purged from
the online database for speed
- security: if DW holds only summary information
- Disadvantages
- extra storage space since transaction data is copied, BUT
transaction data can be summarized first
- not so good if absolutely current data is required
- need to define data sources in advance so not so good if
client has unpredictable needs
- construction and maintenance
return to top
- Introduction
- One-dimensional indexing: B-trees
Designed with minimising disk accesses in mind: ms verses
ns to read from disk than to read from RAM so want to
minimise the number of disk hits.
B+-trees - actual data is at the leaf nodes
only.
B and B+-trees - all leaves at same level; only
ever grow at the root (and shrink at the root furing
delteions): an overfull node is split and this change is
propagated back up through the tree; sometimes on way down
during an insert, if come across a full node may split in
anticipation
- R-Trees
Previously, indexing on multi-dimensional data just used a
different index for each attribute, then had to do joins.
R-Trees are B+-trees extended to hold a n-dimensional
objects. They allow indexing by multi-dimentional spatial
intervals.
- R-Trees
m = minimum number of entries in a node; M = maximum
number of entries in a node; m <=M/2
- Search
Search(S) - for all items whose indices overlap S (can
also alter to do "are contained by S" or "contain S").
Most operations on a database are searches - this is the
one we really want to be fast.
Search(S)
Set T = root node
Until done
if T is not a leaf
search subtrees whose indices overlap S
(i.e. for each that does, set T = new child node)
else if T is a leaf
check each index
if overlap, add to list of qualifying records
done
- Insert
Insert(E)
ChooseLeaf -> L
if num entries in L < M
install E
else
SplitNode(L+E) -> L, LL
AdjustTree(L) or AdjustTree(L,LL)
if root split
put as children of new root
ChooseLeaf
N = root node
repeat
if N is leaf, return N
else
choose F in N such that index, FI, would require least
enlargement to accomodate E (if tie, choose F with the
smallest bounding rectangle, FI)
N = FP
SplitNode
split in such a way as to minimise sum of two covering rectangles
after split; SplitNode algorithms: Exhaustive, Quadratic
and Linear; see Splitting below
AdjustTree(L[,LL])
N = L
(NN = LL)
repeat til done
if N is root, done
else
P = parentOf(N)
En is N's entry in P
adjust EnI to tightly enclose all entry I rects in N
if N has partner NN
create new Enn
set EnnP to point to NN
and EnnI to enclose all rects in NN
if room in P
add Enn to P
else
SplitNode(P+Enn) -> P,PP
N = P
(NN = PP)
- Splitting
- Exhaustive
2n-1 non-empty subsets, so this many comparisons
- Quadratic
- PickInitPair - choose most wasteful: i.e. pair of
items whose bounding rectangle is covering the most
space excluding that covered by the two items (takes n(n-1)/2
calculations)
- PickNext - for each item, calculate increase in
Area of adding item to each group and calculate diff =
|A1 - A2|, choose item with the
biggest differenct (again n2 calculations)
- Linear
- Linear PickPair - along each dimension calculate sepbar =
sep/d, where sep = highest low side - lowest high
side; d=range; (only 2n comparisons); then take the
pair with the greatest sepbar; e.g. below,
sep2/range2 is the greatest so
would take nodes N1 and N4
- Delete
Delete(E)
FindLeaf(E) -> L
if L is null, return
delete E from L
CondenseTree(L)
if root has only 1 child
new root = child
FindLeaf(E)
N = root node
repeat
if N is not a leaf
for each F in N
if FI overlaps EI
N = F
else
for each F in N
if FI matches EI
return N
CondenseTree(L)
N = L
Q = {} (list of eliminated nodes)
repeat
if N is root
Reinsert(Q)
else
P = parent of N
En is N's entry in P
if <m entries in N (**see later)
delete En from P
add N to set Q
else
adjust EnI to tightly contain all entries in N
N = P
Reinsert(Q)
for each node, N, in Q
if N is a leaf node
Insert(N)
else
insert node at correct level
** Dealing with underfull nodes
- compare with B-trees - merge with a sibling or
orphaned nodes distributed amoung siblings
- So why choose to reinsert?
- does the same thing and we can use the already
created Insert method, and the efficiency is
comparable as the pages required for reinsertion are
usually the same ones used in the search and so are
still in memory
- indrementally refines structure preventing gradual
deteriorations
- R*-Trees
- Main Contributions
- Forced reinsert of p entries when have overflow
- this single change gives the best performance
enhancement
- Different parameters to be optimised
- Area (Glutman's R-Trees)
- Margin (Perimeter)
- Overlap
- Utilisation (Storage)
- R-trees with larger m (minFill) therefore
each node is closer to MaxFill
- m ≈ 40% M as optimum - they set m early
on and played with a-c
- BUT this conflicts with a-c
- Insert
Insert(E)
ChooseLeaf -> L
if num entries in L < M
install E
else
if L is the root or ReInsert already performed
Split
create new root if necessary
else
ReInsert(L+E)
AdjustTree(L)
if root split
put as children of new root
ChooseLeaf
N = root node
repeat
if N is leaf, return N
else
if the nodes in N point to leaves
(minmise the overlap cost)
choose F in N such that rectangle, FI, needs the least
overlap enlargement to include the new data (if tie,
choose F with the smallest Area enlargement)
else if the nodes in N point to other nodes
(minimise the Area cost - same as before)
choose F in N such that index, FI, would require least
enlargement to accomodate E (if tie, choose F with the
smallest bounding rectangle, FI)
N = FP
- Point Queries

non-deterministic - depends on the order in which the
elements were inserted
- Area Queries
Therefore as the size of the query increases it should be
better to minimise the perimeter total. So, if doing
mainly point queries, minimise area, and if doing mainly
area/range/spatial queries minimise perimeter.
We prefer square rectangles because this is the largest
amount of area for the smallest perimeter
- Packed R-Trees
- Preprocessing
Packing algorithms are used when you have alot of initial static data
with which to create your R-tree. In this way you can sort all
your data and then pack it into the leaves, then build from the
bottom-up the rest of the R-tree.
- Preprocess data file of N objects into ciel(N/M) groups
of M rectangles (the last group may not be full).
- Load each group into a page and output the <MBR, page_number>
(MBR is the minimum bounding rectangle).
- Recursively pack these MBR's. Use the page_numbers for pointers
from parents.
The only difference between the three packing algorithms below is how
you order the rectangles for putting into the leaves for the first step.
- Hilbert space-filling
- Uses how far along the Hilbert fractal line each rectangle's centre is.
- Hilbert fractals of order 1, 2, 3 and 4 are shown below:
- Nearest-X algorithm
- very simple - all rectangles ordered by the x-coordinate of their centre
- creates long skinny bounding rectangles, therefore perimeter not optimised
and so not so efficient for range queries (OK for point queries though)
- STR (Sort-Tile-Recursive) algorithm
- First, determine how many nodes will be needed: num_nodes = ciel(N/M).
e.g. if we have 100 rectangles/objects and our max_fill = M = 8, then we
will needs a total of ciel(100/8) = 13 nodes (leaves in this case as
we are doing the bottom level of the R-tree).
- Then, since we want to tile these nodes into an x by x matrix, we first calculate
` x = ciel(sqrt(num_nodes)). We start with x vertical strips,
so we calculate how many rectangles
go into each strip, v_fill = M*x. e.g. x = ciel(sqrt(13))=4, so we split
the data up along the x-axis into 4 groups. In each strip there
will be v_fill = 4*8 = 32 rectangles.
- In each vertical strip, beginning at the top and working your way down (i.e. sorting on y-value),
put M rectangles into each tile/node. e.g. In our first vertical strip we
place 8 rectangles, in the second 8 and so on. However, in the last vertical
strip there will be a total of only 4 rectangles, so these all go into the
same node.
- Using the MBR's from the leaf nodes, recursively tile your way up the R-tree till
you reach the root node. e.g. We will need a total of ciel(13/8)=2 nodes
for the next node up, and the next node above that we will only need 1 node so that
is the root node.
The STR authors used small buffer (LRU: least recently used) sizes for
testing so they could mimic large data collections.
They found no clear winner between the packing algorithms, although
Nearest-X is only good for point queries, and STR is simpler than Hilbert space-filling (HS).
Their statistics were flawed as they assumed a normal distribution when
calculating how well STR had performed over HS.
return to top
|