COSC 454 notes: Database Theory and Applications

return to content page/computer science home

Topics:

  1. Relational Theory
  2. Oracle Database Administration
  3. Temporal Databases
  4. Non-relational Data Models
  5. Distributed Databases
  6. Data Mining
  7. Data Warehousing
  8. Multi-dimensional Indexing

  1. Relational Theory
    1. The ANSI/SPARC Architecture
    2. ANSI/SPARC Architecture

    3. Relations
      • relation scheme R = {A1,...,An} (list of attribute names)
      • for each Ai, there is a domain of Ai = Di = dom(Ai) (set of possible values for that attribute)
      • a relation r on a relation scheme R:
          r = {t1,...,tp} (a tuple): t(Ai) Di

    4. Functional Dependencies
      • X → Y: attribute Y of R is functionally dependent on attribute X of R iff for each X-value, there is only one legal Y-value
      • e.g. {Person, time, date} → Place

    5. Heath's Theorem: Lossless decomposition
    6. R[A,B,C] and A → B
      then can get a lossless decomposition of R into R1[A,B] and R2[A,C]
      (An example of a lossy decomposition would be R1[C,A] and R2[C,B])

    7. Testing for a Lossless/Non-additive Join
    8. Given R → R1,R2,...,Rn
      want to check that R1R2 ...Rn
      Algorithm
      1. Create matrix: Ri × Aj, and initialize each cell to bij
        e.g. R1={A,B}, R2={A,C} and F={A→B}
        j=1j=2j=3
        ABC
        i=1R1 b11 b12 b13
        i=2R2 b21 b22 b23

      2. For each row, Ri:
            for each attribute AjRi, set cell to aij
        e.g.

        R1={A,B}
        ABC
        R1 a11 a12 b13
        R2 b21 b22 b23

        R2={A,C}
        ABC
        R1 a11 a12 b13
        R2 a21 b22 a23

      3. Repeat till no change:
        • for each FD, X → Y F:
          • for all rows which have the same values (a's or b's), make their Y columns the same:
            • if at least one is an a, set all to aj
            • else choose an i and set all to bij
        e.g. F={A→B}
        ABC
        R1 a11 a12 b13
        R2 a21 b22 a23

        A→B
        ABC
        R1 a11 a12 b13
        R2 a21 a12 a23

      4. If there exists a row that has all a's in it then the join is lossless (non-additive)

    9. Inference Axioms
      1. Maier's List
        1. Reflexivity
        2. Augmentation
        3. Additivity/Union
        4. Projectivity/Decomposition
        5. Transitivity
        6. Pseudotransitivity
        • X→X
        • X→Y implies XZ→Y (or XZ →YZ)
        • X→Y and X→Z implies X→YZ
        • X→YZ implies X→Y
        • X→Y and Y→Z implies X→Z
        • X→Y and YZ→W implies XZ→W

      2. Armstrong's Axioms
      3. 1, 2 and 6 above: Reflexivity, Augmentation and Pseudotransitivity
        sound and complete

    10. Closure of a Set of Dependencies, F+
      • F+ = closure on a set of dependencies, F
            = all the FDs implied by F
      • includes trivial dependencies, e.g. A→A etc.
      • if F and G are FDs on a relation R and F+ = G+
        then F≡G and F is cover for G and G is cover for F
      • Maximum number of FDs that a relation R can satisfy is (2n-1)2 since for a FD such as X→Y, both X and Y can be any combination of n attributes

    11. Minimal Cover
    12. G is minimal cover (an irreducible set) of F if G is a covering set of the FDs in F and:
      • every RHS of the FDs in G is a single attribute
      • every LHS is irreducible (no redundencies)
      • no FD in G is redundent

    13. Closure of a Set of Attributes, A+
      1. Calculation of closure of a set of attributes, A
      2. Algorithm
        • oldA = {}
        • newA = A
        • while oldA != newA
          • oldA = newA
          • for each FD, X→Y, in F
            • if X newA
              • newA = newA Y
        • return newA

      3. Does a set of FDs, F imply X→Y?
      4. Compute X+ (the closure of X on F)
        if Y X+ then true
        else false

    14. Keys
      1. superkey
      2. X is a superkey of R if:
            X → A1A2...An F+

      3. candidate key
      4. X is a candidate key of R if:
        • X is a superkey of R and
        • for no proper subset Y X is Y → A1A2...An F+ (i.e. cannot discard a single A in X to leave a superkey)
        To determine if a set of attributes, A, can be a candidate key of a relation, R, given a set of FDs, F:
        • calculate A+ using F
        • if A+ includes all the attributes in R then yes, A is a candidate key

    15. Preservation of Dependencies
    16. F is a set of FDs for a relational scheme R
      F+ is the closure of F
      Π(Z)(F) = projection of F onto a set of attributes Z
      = dependencies X → Y in F+: XY Z
      Therefore, to test preservation of dependencies when splitting R into R1, R2, etc.
      1. calculate F+
      2. calculate F1 = Π(R1)(F), F2 = Π(R2)(F), etc.
      3. Is F+ = F1 F2...?

    17. Normal Forms
      1. N1NF (Non First Normal Form)
      2. e.g. {name,email}={("Alice","alice@orcon, ahol@cs")}

      3. 1NF (First Normal Form)
      4. A relation is in 1NF if the domains of all attributes contain only atomic values and the value of any attribute in a tuple is a single value from the domain.
        e.g.1. {name,email}={("Alice","alice@orcon"),("Alice","ahol@cs")}
        e.g.2. R[ABCDE], F={A→C,B→E,AB→D}, CK=AB

      5. 2NF (Second Normal Form)
      6. A relation is in 2NF if it is in 1NF and every non-prime attribute is fully functionally dependent on the primary key (i.e. no partial dependencies: ).
        e.g. R[ABCD], F={AB→C,AB→D,C→D},CK=AB

      7. 3NF (Third Normal Form)
      8. A relation is in 3NF if it is in 2NF and no non-prime attribute is transitively dependent on any key, or in other words:
        A relation R is in 3NF if, for every non-trivial FD, X→A either
        1. X is a superkey of R, or
        2. A is a prime attribute of R
        e.g. R[SCP], F={SC→P,P→C}, CK=SC

      9. BCNF (Boyce-Codd Normal Form)
      10. A relation is in BCNF if it is in 3NF and every determinant (X→Y, X is the determinant) is a candidate key, or in other words:
        A relation R is in BCNF if it is in 2NF and, for every non-trivial FD, X→A
        • X is a superkey of R
        e.g. R[ABCD], F={A→B,B→A,A→C,A→D,B→D}, CK=A|B

      11. 4NF (Fourth Normal Form)
      12. Multi-valued dependencies (MVD) have been removed.
        A→→B (A multidetermines B) iff whenever there are two tuples, say t1 and t2, where the values for A are the same, but B and C may be different, then there will exist two other tuples, t3 and t4, that give the two different combinations of B and C values:
        ABC
        t1 a1 b1 c1
        t2 a1 b2 c2
        t3 a1 b1 c2
        t4 a1 b2 c1
        And we write A→→B or equivalently A→→C, or A→→B|C

      13. 5NF or PJNF (Fifth Normal Form or Project Join Normal Form)
      14. Every join dependency in R is a consequence of the candidate keys of R.

    18. Normalisation
      1. Point of Normalisation
      2. Remove certain kinds of redundencies, avoid certain update anomalies, to produce an intuitive representation of the real world and to simplify the enforcement of integrity constraints.

      3. Decomposition into 3NF
        1. begin with a relation scheme R and F, a minimal cover set of dependencies
        2. remove attributes in R not involved in F to a separate scheme
        3. for each dependency, X→A in F, create a relation scheme [XA] (note, if have X→A1,X→A2 etc., combine in a single scheme: [XA1A2...])

      4. Decomposition into BCNF
      5. For each functional dependency, X→A where X is not a superkey of R, separate into the relations R1[XA] and R2[R-A]. R2[R-A] may not be in BCNF so may have to repeat.

      6. Decomposition into 4NF
      7. Remove multi-valued dependencies: if R[ABC] and A→→B|C then split up into R1[AB] and R2[AC]

    return to top


  2. Oracle Database Administration
    1. An Oracle Server
      1. An Oracle Database
      2. Data files, redo log files, security control files, a parameter/init file, trace files, an alert log.
        ANSI/SPARC Architecture

      3. An Oracle Instance
      4. The dynamic part: Processes (both SYSTEM and user). Consists of a System Global Area (SGA - includes a DB buffer cache and a Redo log buffer), server processes produced to handle user requests, and a set of background processes (includes a database writer, a log writer, checkpoint, system and process monitors, an archiver, recoverer processes, dispatchers and lock processes).

    2. Role of the DBA
    3. A data administrator makes the strategic and policy decisions about the data, the DBA implements these.
      1. Defines both the conceptual schema (tables, views, indices, synonyms etc.) and the internal schema (Tablespaces - Oracle will automatically create a SYSTEM tablespace, at least one user tablespace is also required), and the mapping between the two.
      2. Liase with users.
      3. Defines the security and integrity constraints.
      4. Defining dump and reload policies: recovery.
      5. Monitoring performance and responding to changing requirements (e.g. new user, add column to a table, etc.)
      Note: Oracle produces a few tables automatically: all_tables, dba_users...

    4. Physical Storage in Oracle
    5. Data block (logical block, page, Oracle block) - fixed size
      Extent - specific number of contiguous blocks
      Segment - a set of extents used for a certain logical structure, e.g. data segment, index segment, rollback segment...

    6. Using the Server Manager
      1. Commands available
      2. As DBA, use the server manager (svrmgrl to run the command line version) to administer the database.
        CONNECT|DISCONNECT First thing to do:connect internal, can also create a dummy user and connect as them to check how things are going, or connect as sysdba or SYSTEM
        STARTUP|SHUTDOWN the oracle instance - need to keep this running for your users, to check:
        ps auwx | grep -i ora
        ARCHIVE LOG start | stop | list | next | all | 'destination'
        RECOVER database [manual] | tablespace tsname[,tsname2]
        SET|SHOW instance | echo | termout | timing | numwidth | charwidth | longwidth | datewidth | autoprint, in addition SHOW has all | spool
        EXITfrom svrmgrl
        and SQL statements (note: to run a set of SQL statement from file type: @filename.sql).

      3. Users
        1. Creating a new user
        2. create user username identified by password default tablespace tbsname temporary tablespace tbstempname;
        3. Changing a user's password
        4. alter user username identified by newpassword;

      4. System Privileges
        1. Granting
        2. grant [connect][resource] to username;
          Connect just allows a user to login, resource allows them to build their own tables etc. Usually a user will need both.
        3. Revoking
        4. revoke [connect][resource] from username;
        5. Checking
        6. select * from dba_role_privs [where grantee = 'username'];

      5. Object Privileges
        1. Granting
        2. grant [select][insert][delete][all] on owner.tablename to username|rolename|PUBLIC [with grant option];
        3. Revoking
        4. revoke [select][insert][delete][all] on owner.tablename from username|rolename|PUBLIC;
        5. Checking
        6. select * from dba_tab_privs where grantee = 'username';

      6. Synonyms and Object Ownership
        1. Checking Object Ownership
        2. select owner, object_name, object_type from dba_objects where [grantee = 'username'] | [object_name = 'tablename'];
          Determines either what a particular user owns, or who owns a particular table.
        3. Creating
        4. To avoid users having to know the owner of a table they have object privileges on, we can create synonyms:
              create public synonym syntablename for owner.tablename;
        5. Dropping
        6. drop public synonym syntablename;

      7. Roles
        1. Creating
        2. To simplify the granting/revoking of privileges we can create roles. Then users are granted roles and roles are granted priveleges (just like users are).
              create role rolename;
        3. Applying to users
        4. grant rolename to username;
        5. Altering or dropping a role
        6. Altering a role can be quite messy. It may be easier to simply drop a role and then create it again.     drop role rolename;

    7. Backing Up and Recovery
    8. Regular backups shoule be made. This can be done by exporting the whole database on a regular basis to a flat .dmp file (using the exp command). Recovery can be accomplished by importing (using the imp command) the backed up file. In linux, can use crontab to automatically run a bash script which does the backups.

    return to top


  3. Temporal Databases
    1. Temporal Concepts
    2. Most DBs simply overwrite old data when updating. Need a TBDMS (temporal database management system) and associated TSQL.

      1. Time, Granularity and the Chronon
      2. Time may be discrete or continuous, but in a database application it will be discrete. The granularity of time (e.g. nearest ms|s|min|h|d|y etc) is specific to an application. A chronon is the smallest unit of time used. An interval is a set of successive chronons. An event that occurs instantaneously will exist for one chronon.

      3. Time Dimensions
        1. Valid time
        2. also real-world time, intrinsic time, logical time, data time.
          The real-world time during which an object exists or an event takes place.
        3. Transaction time
        4. also registration time, extrinsic time, physical time.
          The DBMS time at which an event was entered into the temporal database.
        5. User-defined time
        6. A domain supported by the data model, as defined by the user, e.g. dob.

      4. Allen's Temporal Operators and TSQL2
      5. Used to describe temporal relationships between two time intervals, e.g. X and Y:
        Allen's operator TSQL2 operator
        X EQUALS Y
        XXX
        YYY
        
        =
        X BEFORE Y
        XXX
             YYY
        
        X PRECEDES Y
        X MEETS Y
        XXX
           YYY
        
        END(X) = BEGIN(Y)
        X OVERLAPS Y
        XXX
         YYY
        
        X OVERLAPS Y AND
        END(X) PRECEDES END(Y)
        X DURING Y
         XXX
        YYYYY
        
        BEGIN(X) PRECEDES BEGIN(Y) AND
        END(X) PRECEDES END(Y)
        X STARTS Y
        XXX
        YYYYYY
        
        BEGIN(X) = BEGIN(Y) AND
        END(X) PRECEDES END(Y)
        X FINISHES Y
           XXX
        YYYYYY
        
        BEGIN(X) PRECEDES BEGIN(Y) AND
        END(X) = END(Y)
        In TSQL2, to refer to the valid time interval of a table use:
            VALID(tablename)
        Thus, to refer to the specific v_start and v_end columns of a table use:
            BEGIN(VALID(tablename))
            END(VALID(tablename))
        Then can use the PRECEDES, =, OVERLAPS or CONTAINS operators to compare to another time interval (or PERIOD(starttime, endtime).

    3. Valid Time (Historical) Databases
    4. e.g. a dblog of when user sessions, so v_start is their login time and v_end is their logout time (before they logout this is set to the current time symbol: ).
      e.g. of a query: Who was logged in, and to which computer, at 11.05am?
          SELECT d.user, d.computer
            FROM dblog d
            WHERE VALID(d) CONTAINS '11.05';
      

    5. Bitemporal Databases
    6. Includes both valid time (v_start and v_end) and transactional time (t_start and t_end).
      Nothing is ever deleted (only current times can get pinned down).

      I think of these as belief revision databases where the transactional period is the period over which a belief is/was held. Thus it is obvious that two conflicting beliefs cannot be held concurrently, so if need to change belief at chronon c, then old belief must end at c-1 for new belief to begin at c.

      e.g.
      1. Monday: Three suppliers (S1, S2 and S3) contracted on Monday and loaded into the database.
        snumsnamecity v_startv_endt_startt_end
        S1SmithLondon Mon Mon
        S2JonesParis Mon Mon
        S3BlakeParis Mon Mon
        Consider the in the t_end column as denoting a current belief.
      2. Tuesday: Supplier S4 starts as a contractor, and DB is updated. Supplier S1's contract is cancelled as from today, and DB updated immediately.
        snumsnamecity v_startv_endt_startt_end
        S1SmithLondon MonMon Mon
        S2JonesParis Mon Mon
        S3BlakeParis Mon Mon
        S4ClarkLondon Tues Tues
      3. Wednesday: Supplier S3 moves to New York and change noted immediately. Supplier S57 starts.
        snumsnamecity v_startv_endt_startt_end
        S1SmithLondon MonMon Mon
        S2JonesParis Mon Mon
        S3BlakeParis MonTues Mon
        S4ClarkLondon Tues Tues
        S3BlakeNew York Wed Wed
        S57HeinzBerlin Wed Wed
        Note that the belief that S3 was in Paris between Mon and Tues inclusive has not changed so we donot need to change the t_end.
      4. Thursday: The auditor comes: S2 didn't actually start till Tues; S57 doesn't, and never did, exist; S1 is still in Paris and always has been.
        snumsnamecity v_startv_endt_startt_end
        S1SmithLondon MonMon MonWed
        S2JonesParis Mon MonWed
        S3BlakeParis MonTues Mon
        S4ClarkLondon Tues Tues
        S3BlakeNew York Wed Wed
        S57HeinzBerlin Wed WedWed
        S2JonesParis Tues Thurs
        S1SmithParis MonMon Thurs
      5. Friday: S1 is now trading as Smith International of Rome and is given a new contract.
        snumsnamecity v_startv_endt_startt_end
        S1SmithLondon MonMon MonWed
        S2JonesParis Mon MonWed
        S3BlakeParis MonTues Mon
        S4ClarkLondon Tues Tues
        S3BlakeNew York Wed Wed
        S57HeinzBerlin Wed WedWed
        S2JonesParis Tues Thurs
        S1SmithParis MonMon Thurs
        S1Smith IntlRome Fri Fri

    return to top


  4. Non-relational Data Models
    1. The Network Model
      1. The CODASYL Model
      2. The CODASYL Model
        DDL: Data Description Language, language independent in the Schema level but language compatible in the Sub-schema level
        DML: Data Markup Language, always language compatible
        Advantage: each language could define its own DDL and DML. But, language compatible DDL and DML only defined for COBAL.

      3. Bachman diagrams
      4. Data model objects:
        • record - group of related data values, a tuple
        • set - describes a 1:N relationship between two record types; has a name (type of relationship, e.g. major_dept), an owner record (e.g a department) and a linked list of member records (e.g. a list of students)
        A set type consists of a name, an owner record type and a member record type.
        e.g. Bachman diagram

      5. Sample Data
      6. Sample network data

        So, each record can have links to a parent/owner of a set (if it is the last member record in a set), to the first child in a list of member records of a set, or to the next member record in a list of a set. Also, each record could be involved in different sets - as members or owners.

      7. Data retrieval: Navigation
      8. Uses pointers to current set and record in that set. Like any linked list datatype, such a structure is fast for certain kinds of retrieval processes, e.g. iteration.

    2. The Hierarchical Model
      1. Architecture
      2. IBM's Information Maganagement System (IMS):
        The CODASYL Model

        Based on business models. Records and parent-child relationships (1:many, similar to sets in the network model except that there are no pointers back to the parent).

        Very efficient for many applications, but difficult to extend or alter once built (i.e. lacking in relational flexibility to add/remove tables).

      3. Hierarchical Schemas
      4. e.g. A Hierarchical Schema

      5. Sample Data
      6. Sample hierarchical data

    3. The Object Model
    4. The next data model!
      The Object Oriented Database System Manifesto lists 13 mandatory features:

      1. OO Mandatory Features
        1. Complex objects
        2. Made up from the simplest objects: integers, characters, bytes, strings, booleans and floats.

          Also need:

          • tuples - represent properties of an entity
          • sets - natural way of representing collections
          • lists - or arrays, to represent order

        3. Object identity
        4. An object's existence is independent of its value. Thus identical objects are the same objects, but can also have equal objects which are objects with the same value. Differnet objects can also share a common component so that one object can update the value of this shared component.

        5. Encapsulation
        6. Provides logical data independence: we can change how a type is implemented without needing to change the programs that access it.

        7. Types and classes
          • type - compile time checking; summarizes common features; abstract data type; interface and an implementation
          • class - run time manipulating; object factory creates new instances, object warehouse stores those instances; can operate on entire warehouse at once
        8. Class or type hierarchies
        9. Utilize inheritance, which reduces coding and helps reusibility

          e.g. Inheritance example
        10. Overriding, overloading and late binding
          • Overriding e.g. in Picture, Person and Graph object, each have a method called display()
          • Overloading e.g. display(Picture), display(Person) and display(Graph)
          • Or we can just have display(Object) and let the method itself work out the type of the Object and display it accordingly.
        11. Computational completeness
        12. Ability to express any computable funciton in the DML.

        13. Extensibility
        14. Ability to define new types, and no distinction in usage between system and user-defined types.

      2. DB Mandatory Features
        1. Persistence
        2. Should be implicit - i.e. user should not have to explicitly save the data to make it persist.

        3. Secondary storage management
        4. Hidden (from user) performance features: index management, data clustering and buffering, access path selection, query optimization. DBA may have some control in programming these.

        5. Concurrency
        6. Multiple users and queries, so must still support atomicity of a sequence of operations and controlled sharing using locks etc.

        7. Recovery
        8. Recovery from processor or disk failures.

        9. Ad hoc query facility
        10. Provide a query language, as the DML or a subset of the DML, that is:
          • high level - express requests simply
          • efficient - lends itself to optimization
          • application independent - work on any database

      3. Gemstone and Opal
      4. Gemstone is an Object Oriented system which uses Opal as the data language.

      5. OO verses Relational Models
      6. Advantages of OO:
        • User-defined data-types - efficient type-specific representations
        • Inheritence
        • Reusibility of code
        • Pure ODBMS - better represents semantics of real-world right in the datalevel itself without the need for an extra layer
        • Object Extended RDBMS - represents the semantics with an extra layer mapping from the database to a semantically rich data model
        Disdvantages of OO:
        • OIDs - we lose multi-values keys
        • Lack of declarative integrity support (procedural only)

        Extending the Relational model: e.g. have a student with a name, address, academic_record, photo etc. Traditionally would store name and address as strings with the academic_record and photo as blogs. Would be better to produce ADT's for each of these.

      7. The ODMG Standard
      8. The Object Data Management Group (ODMG) describes three types of OO language: ODL (Object Data Language); OML (Object Manipulation Language); OQL (Object Query Language)

    return to top


  5. Distributed Databases
  6. I decided not to study for this section as I didn't feel the lectures gave me a good enough understanding of it.

    return to top


  7. Data Mining
    1. Data Mining verses OLAP
    2. OLAP (on-line analytical processing) tests relationships between data that someone has hypothesized (hypothesis testing), whereas data mining finds relationships between data (hypothesis generation).

      Data mining is involved in the description and prediction of data.

      A model is a general was of describing a dataset, e.g. y=mx+c. A pattern is an instance of a model, e.g. y=2x+3.

    3. Segmentation
      1. Classification
      2. Know the groups a priori. Supervised learning.
        1. Using neural networks
        2. Feed-forward backpropagation networks. If no hidden layer then can only perform linear regression. Use a test set to determine when to stop training to prevent overfitting.

          Advantages:
          • parallel processing is easily used
          Disadvantages:
          • black box predictor (people want to know why)
          • categorical explosion: use one input unit for each category
          • can find false patterns if signal:noise ratio is low
          Works best when:
          • data set is large
          • signal:noise ratio is high
        3. Using decision trees
        4. Iterative splitting of data into discrete groups with maximal distance between each group (e.g. as determined by chi-sqrd). Decision nodes; leaf nodes; binary trees; multiway trees.

          1. classification trees - predict categorical data
          2. regression trees - predict continuous data
          Advantages:
          • can explain its predictions
          • make relatively few passes through the data
          • work well with many predictor variables
          • good for large data sets
          Disadvantages:
          • interpretation may be difficult: different trees can give the same outcome
          Overfitting is also a problem for decision trees. We can use stopping rules (e.g. limit the maximum depth, limit the minimum number of records per node) or pruning (heuristics or user intervention).

      3. Clustering
      4. Discover the groups. Unsupervised learning. Find groups that are very different from each other but whose members are all very similar.
        1. K-means
          • Data types:
            1. continuous data
            2. categorical data
              1. ordinal - ordered, BUT distance between points not guaranteed e.g. satisfaction rating
              2. nominal - e.g. poisonous/nonpoisonous, post-codes, car make, colours (sometimes continuous)
          • e.g. K-3 algorithm:
            1. pick 3 random points as centroids
            2. assign each remaining point to its closest centroid - results in 3 clusters
            3. recalculate best centroid for each cluster
            4. repeat 2-3 till centroids don't move
            Sensitive to choice of initial centroids so: pick first centroid at random, second to be furtherest in dataset from this point, and third furtherest from these two points.
        2. Agglomerative clustering
        3. Good for getting an idea of how many clusters we should have. Basic algorithm:
          1. each data point begins in its own group: a singleton
          2. compare all pairs of groups and mark the group that is closest
            • if distance between these two groups is < some threshold then merge the two groups
            • else done
          If the threshold value is too low then may have too many groups at the end, many singletons. But is threshold is too large will get clusters containing dissimilar items.
        4. K-medoids
        5. The problem with K-means and agglomerative clustering is that the centroids are not necessarily meaningful for categorical data. So can use, e.g. PAM (Partitioning Around Medoids). Here we recalculate the medoid by swapping, BUT not very efficient.
        6. Dendrogram
        7. For display of clustering results:
          An example dendrogram

    4. Link Analysis
      1. Association discovery
      2. AB
        antecedentconsequent
        e.g. shopping basket: buy alcoholbuy burger
        e.g. using the shopping basket: {(a,b,c),(a,c,d,f),(a,c),(b),(b,c)}
        support/prevalence
        supp(A) = Pr(A)
        = number of times A appears in table [/ num rows in table]
        e.g.supp(a) = 3/5 = 0.6
        e.g.supp(a→b) = 1/5 = 0.2
        confidence
        Confidence(A→B) = supp(AB)/supp(A)
        e.g.Confidence(a→b) = 1/3
        e.g.Confidence(a→c) = 3/3 = 1
        Low support thresholds leads to a rule explosion, BUT lowe support plus high confidence is important sometimes, e.g. diagnosis of rare diseases.
        lift
        Lift(A→B) = Confidence(A→B)/supp(B)
        e.g.Lift(a→c) = 1/(4/5) = 5/4
        Once have a good rule, lift is useful to calculate to determine if the antecedent is a sufficient indicator of the consequent. If confidence is high (so the rule is good) but lift is much bigger than the confidence then to predict the consequent we would need more such rules. e.g. Confidence(a→c) is high(1) but Lift(a→c) = 5/4 > Confidence(a→c)

      3. Sequence discovery
      4. Association related over time.

    5. Decision-Tree Classification using SPRINT
      1. Overview
      2. Attributes:
        • categorical
        • continuous
        • classifying attribute = class
        Each non-leaf node is a split point.
        1. growth phase - till each partition pure or small enough
        2. prune phase - <1% of computation time

      3. Partitioning
        1. The partition algorithm
        2.     Partition(Data data)
                  if data is pure (i.e. all records are in the same class)
                      return
                  for each attribute A
                      evaluate the best split point on attribute A
                  using the best split point, split data into datasets 1 and 2
                  Partition(dataset1)
                  Partition(dataset2)
          
        3. Continuous data
          • All attrubute lists are sorted.
          • To find the best split point on an attribute, start at the beginning of the attribute list and use two arrays (done and notDone) which represent the counts of the classes before and after the current cursor position.
        4. Categorical data
        5. Instead of entire attribute lists to scroll through we can just calculate a single count matrix for each category type. Here the rows are the categories for the category type (e.g. for car type, the rows may be: station_wagon, hatch_back, sports, family etc) and the columns are for the classes. We then recursively determine which category value should be split on.

        6. Calculation of best split point
        7. For two attrubutes we use the values for one of the classes in these arrays to calculate the best split point (that with the lowest ginisplit) for spitting the dataset d into datasets d1 and d2:

          calculation of
		gini split

          where n1 and n2 are counts of the number of items in this class in datasets d1 and d2 respectively, and

          calculation of the
		gini index

          where p is the relative frequency of the chosen class j in dataset d.

        8. Performing the split
        9. Partition the attribute list of the winning attribute, using a list of the rid's that will go on one side (either always the LHS, or calculate the side that will hold the least number of records).

      4. Advantages of SPRINT over SLIQ
      5. SLIQ-R - replicated class list. SLIQ-D - distributed class list - performed worst. SPRINT uses less memory than SLIQ.

        • removes most memory restrictions
        • fast
        • scalable (c.f. SLIQ thrashes above a certain number of records
        • paralletlizable

      6. Disadvantages
      7. Is a greedy algorithm:

        • it minimises diversity at each split, but locally only...not look-ahead
        • tends to stop at local minima - may consider using NN's and simulated annealing

      8. Pruning
      9. Want to prune if data overfitted or the tree is just too big. Use a parameter t. If t is large then get a smaller tree; if t is small then get a larger tree. Either dial t up and down, or calculate t when we chose to cut at particular notes, then test each pruned tree on the test data and determine which is the best t value. i.e. trial and error, and can lead to rules of thumb.

        MDL (minimum description length) = size(tree) + size(misclassifications(tree)).
        Only grow if description length is reduced - so can be used during splitting.

      10. Parallelizing the Classification
      11. Each attribute list split over N processors.

        • continuous data is sorted first then each processor gets a contiguous section and so the initial countdone depends on previous processors, and the initial countnotDone is calculated from this and subsequent processors.
        • categorical data: local count matrix → global count matrix, and coordinator processor evaluates the ginisplit

        The hashtable/list of rids for splits needs to be shared/communicated between all processors.

    6. Association Mining
      1. Association Rules
      2. Let I = a set of items.

        Then A → B where A I and B I and A B = {}

        • holds with confidence c if c% of transactions (rows) in the database that contain A also contain B.
        • has support s if s% of transactions (rows) in the database contain A B

      3. Subsets
      4. For a set with n items, there are 2n subsets.
        e.g. {1,2,3,4,5,6}, possible subsets = [{},{1},...,{6}, {1,2},...,{1,6},{2,3},...,{2,6},...]
        123456
        {}000000
        {1}100000
        {2}010000
        etc.

      5. Apriori algorithm
        1. Algorithm
        2. L1 = {large (count >= min_support) 1-itemsets}
          for (k = 1; Lk != {}; k++)
          • Ck+1= join Lk with Lk (each itemset with first k-1 items, the prefix, in common and last item not in common) then prune out ck+1 where any subset of this made by removing one item is not in Lk
            for each transaction in the database
            add to count of c Ck+1 if the transaction contains this subset c
            Lk+1= Ck+1 only keeping those whose count is > min_support
          Really expensive as everything needs to calculate support for items in the candidate set, Ck, must scan the entire database.
        3. Example
        4. e.g. transaction records(as <tid, {items}>) = {<100, {1,2,3,4}>, <200, {1,3,4,5,6}>, <300, {2,3,4}>, <400, {3,5,6,7}>, <500, {1,3,4,6}>}, and let min_support = 3 and min_confidence = 50%
          1-itemsets 2-itemsets 3-itemsets
          itemsupport itemssupport itemssupport
          {1}3 {1,2}1 {1,3,4}3
          {2}2 {1,3}3 {1,3,6}2
          {3}5 {1,4}3 {1,4,6}2
          {4}4 {1,5}1 {2,3,4}2
          {5}2 {1,6}2 {3,4,5}
          {6}3 {2,3}2 {3,4,6}2
          {7}1 {2,4}2 {3,5,6}2
          {2,5}0
          {2,6}0
          {3,4}4
          {3,5}2
          {3,6}3
          {4,5}1
          {4,6}2
          {5,6}2

          Thus, L1 = {{1}, {2}, {3}, {4}, {5}, {6}} since set {7} does have enough support (must be >= 2). L2 = {{1,3}, {1,4}, {1,6}, {2,3}, {2,4}, {3,4}, {3,5}, {3,6}, {4,6}, {5,6}} since only these subsets have enough support in the transaction data. Subset {3,4,5} was eliminated from C3 because removing item 3 would have resulted in subset {4,5} which is not in L2. Thus, L3 = {{1,3,4}, {1,3,6}, {1,4,6}, {2,3,4}, {3,4,6}, {3,5,6}}. Only the first two subsets of these can be combined (since they both have the prefix {1,3} in common) to give set {1,3,4,6}. All of the subsets of this set produced by removing a single item (1 or 3 or 4 or 6) results in a subset that is present in L3. Furthermore, this set also has a support of 2 and so L4 = {{1,3,4,6}}. Thus out of a possible 27=128 subsets, we only generated 30, with 23 having minimum support.

        5. Generation of rules
        6. Taking the only large 4-itemset we can partition the items in the set between the antecedent and consequent of a rule and calculate suport. In practice we only generate those rules where the consequent has one item (and if this item was the class then we are building a classifier).

          e.g. supp({1,3,4,6}) = 2; confidence=supp(AB)/supp(A)
          A → Bsupp(A)confidence
          3,4,6→12100%
          1,4,6→32100%
          1,3,6→42100%
          1,3,4→6367%

      6. AprioriTid algorithm
      7. The database is not used to calculate support after the first pass. Rather we store the required information in a set Cbark in the form <tid, {Xk}>, where Xk is the set of potentially large k-itemsets that are present in the transaction. Ck+1 is still computed from Lk, but Cbark+1 is computed from Cbark and Ck+1. The advantages of AprioriTid only eventuate when Cbark can fit into memory but the entire database cannot.

      8. Apriori-hybrid
      9. Start by using the Apriori algorithm until Cbark could fit into memory, then switch to AprioriTid. However, there is a cost associated with this switch and so if we are just at the end of the search then this can be a waste.

      10. Rare items
      11. Very low support but very high confidence. Not supported by these algorithms. May be useful for e.g. diagnosis of rare diseases; expensive items in a store (could give these a high weight and use a weighted count to determine support).

      12. Lift
      13. Lift(A→B) = Pr(AB)/(Pr(A)Pr(B)). Density of attribute/class relative to its density in the entire database.

        e.g. 1,3,4→6. Pr(6) in db = 3/6; Pr(1,3,4) in db = 3/6; Pr(1,3,4,6) = 2/6; Lift = 2/6 / (3/6 * 3/6) = 12/9 = 4/3. Not very interesting then

    return to top


  8. Data Warehousing
    1. Impetus for data warehousing
    2. Decision support system
      tool(s) which allows business end users to perform computer generated data analysis on their own; includes reporting, OLAP,... ; aim for fast and accurate retrieval
      OLTP
      on-line transaction processing; needs fast throughput

      If want to test a hypothesis against the database, don't want to interrupt the OLTP as this needs fast throughput, AND if want OLAP to be fast so transaction data would be too slow to go through, so summarize first

      data warehouse
      a copy of transaction data specifically structured for querying and reporting

    3. Generation of warehouse summaries
    4. passive/lazy
      only gets data after a query requiring it has arrived
      active/eager
      gets data as it is produced, or periodically, and before queries for it arrive; the approach used by data warehouses

    5. Architecture
    6. Architecture of the Stanford data warehousing project (WHIPS)

      WHIPS
        architecture

      compilation component - user-defined content of the data warehouse (monitors, integrator) - want to automate this too

    7. Monitors
    8. How to tell when new data/change in information source:

      • triggers in information source notifies monitor
      • every application that changes the information source also notifies the monitor (not such a great idea as may be a lot of these apps)
      • use periodical source data dumps into a file - already have this facility for backing up, but get the snapshot differential problem

    9. Snapshot differential problem
    10. Two snapshots of the database as flat files: F1 at day 1, and F2 at day 2:

      • rows = <K, B> (where K = key)
      • compare files to get a stream of change notifications
        • updated (Ki, B1, B2)
        • deleted (Ki, B1)
        • inserted (Ki, B2)
      • problem: rows with the same key may appear in different places in F1 and F2
        • not always that bad, e.g. sliding window algorithm finds K1 in F1 and looks in first N rows in F2, doesn't find the key so creates a deleted notification. Later finds and creates an inserted notification. End result: K1 is still in the DW so no inconsistencies.
        • if B very large then hash to small signature - may miss a few updates but probability is low
        • problem is continuous so when comparing F1 and F2 to make the change notifications, also create an auxiliary file for F2 to make the next comparison easier
        • sort-merge join, hash join, UNIX diff

    11. Integrator
      • rule-based
      • each rule handles one type of change notification
      • may need additional data from DW, especially if DW is of average or summary data (e.g. average salary and have updated the salary of an employee)
      • distributed object system: WH, integrator, monitors, information sources can all be on different machines and programmed in different languages

    12. Warehouse update anomolies
    13. In a normal database we have ACID transactions where tables are locked preventing update anomolies. However in a DW the data is remote and we don't always have the luxury of table locking (sometimes we might if all the datasources are in the same timezone and we can do it in the wee hours of the morning when they're not being used).

      If integrator issues a query back to an information source (since may need more information for an update) but the query is issued after a modification in the source so source may have changed since then. Solution? recompute the view?

      table T1 table T2 Data warehouse
      wx
      12
      xy
      24
      w(T1 T2) = [1]
      i.e. project w from the natural join (on x, to give <x,y,w>) of the two tables
      update(2,3)
      xy
      23
      update notification: u(T2[2,3])
      now WH sends query to T1:
          Query(T1 [2,3])
      insert(4,2)
      wx
      12
      42
      T1 returns [1][4] to query.
      also, update notification: i(T1[4,2])
      so WH sends query to T2:
          Query([4,2] T2)
      T2 returns [4] to query.
      So, in DW we now have [1][4][4] where we should just have [1][4]

      ECA (Eager Compensating Algorithm) - queue of queries at unanswered end; only works when all tables in one datasource

    14. Data reconciliation
    15. Integrator must also reconcile data from different schemas. e.g. database 1: sex:0|1, database 2: sex:M|F, ...

    16. Advantages
      • DW queries don't hold up the daily transactions on the businesses online/live databases (OLTP)
      • can construct the DW to speed up the querying/reporting process
      • can design the DW to simplify the querying/reporting process
      • can chose to store only cleaned up transactional data without having to fix the OLTP systems
      • querying data from multiple, disparate sources
      • DW can hold older (archival, historic) data too that may have been purged from the online database for speed
      • security: if DW holds only summary information

    17. Disadvantages
      • extra storage space since transaction data is copied, BUT transaction data can be summarized first
      • not so good if absolutely current data is required
      • need to define data sources in advance so not so good if client has unpredictable needs
      • construction and maintenance

    return to top


  9. Multi-dimensional Indexing
    1. Introduction
      1. One-dimensional indexing: B-trees
      2. Designed with minimising disk accesses in mind: ms verses ns to read from disk than to read from RAM so want to minimise the number of disk hits. B+-trees - actual data is at the leaf nodes only. B and B+-trees - all leaves at same level; only ever grow at the root (and shrink at the root furing delteions): an overfull node is split and this change is propagated back up through the tree; sometimes on way down during an insert, if come across a full node may split in anticipation

      3. R-Trees
      4. Previously, indexing on multi-dimensional data just used a different index for each attribute, then had to do joins. R-Trees are B+-trees extended to hold a n-dimensional objects. They allow indexing by multi-dimentional spatial intervals.

    2. R-Trees
    3. m = minimum number of entries in a node; M = maximum number of entries in a node; m <=M/2

      1. Search
      2. Search(S) - for all items whose indices overlap S (can also alter to do "are contained by S" or "contain S"). Most operations on a database are searches - this is the one we really want to be fast.
        Search(S)
            Set T = root node
            Until done
                if T is not a leaf
        	    search subtrees whose indices overlap S
                      (i.e. for each that does, set T = new child node)
                else if T is a leaf
        	    check each index 
        	        if overlap, add to list of qualifying records
        	    done
        

      3. Insert
      4. Insert(E)
            ChooseLeaf -> L
            if num entries in L < M
                install E
            else
                SplitNode(L+E) -> L, LL
            AdjustTree(L) or AdjustTree(L,LL)
            if root split
                put as children of new root
        
        ChooseLeaf
            N = root node
            repeat
                if N is leaf, return N
                else
        	   choose F in N such that index, FI, would require least
        	     enlargement to accomodate E (if tie, choose F with the 
        	     smallest bounding rectangle, FI)
        	   N = FP
        
        
        SplitNode
        split in such a way as to minimise sum of two covering rectangles after split; SplitNode algorithms: Exhaustive, Quadratic and Linear; see Splitting below
        AdjustTree(L[,LL])
            N = L
           (NN = LL)
            repeat til done
                if N is root, done
                else
        	    P = parentOf(N)
        	    En is N's entry in P
        	    adjust EnI to tightly enclose all entry I rects in N
        	    if N has partner NN
        	        create new Enn
        	        set EnnP to point to NN
        		    and EnnI to enclose all rects in NN
        	        if room in P
        		    add Enn to P
        	        else
        		    SplitNode(P+Enn) -> P,PP
        	        N = P
        	       (NN = PP)
        

      5. Splitting
        1. Exhaustive
        2. 2n-1 non-empty subsets, so this many comparisons
        3. Quadratic
          • PickInitPair - choose most wasteful: i.e. pair of items whose bounding rectangle is covering the most space excluding that covered by the two items (takes n(n-1)/2 calculations)
          • PickNext - for each item, calculate increase in Area of adding item to each group and calculate diff = |A1 - A2|, choose item with the biggest differenct (again n2 calculations)
        4. Linear
          • Linear PickPair - along each dimension calculate sepbar = sep/d, where sep = highest low side - lowest high side; d=range; (only 2n comparisons); then take the pair with the greatest sepbar; e.g. below, sep2/range2 is the greatest so would take nodes N1 and N4
            Linear Pick Initial Pair

      6. Delete
      7. Delete(E)
            FindLeaf(E) -> L
            if L is null, return
            delete E from L
            CondenseTree(L)
            if root has only 1 child
                new root = child
        
        FindLeaf(E)
            N = root node
            repeat
                if N is not a leaf
        	    for each F in N
        	        if FI overlaps EI
        		    N = F
                else
        	    for each F in N
        	        if FI matches EI
        		    return N
        
        CondenseTree(L)
            N = L
            Q = {} (list of eliminated nodes)
            repeat
                if N is root
        	    Reinsert(Q)
        	else
        	    P = parent of N
        	    En is N's entry in P
        	    if <m entries in N (**see later)
        	        delete En from P
        		add N to set Q
        	    else
        	        adjust EnI to tightly contain all entries in N
        	    N = P
        
        Reinsert(Q)
            for each node, N, in Q
                if N is a leaf node
        	    Insert(N)
        	else
        	    insert node at correct level
        

        ** Dealing with underfull nodes

        • compare with B-trees - merge with a sibling or orphaned nodes distributed amoung siblings
        • So why choose to reinsert?
          1. does the same thing and we can use the already created Insert method, and the efficiency is comparable as the pages required for reinsertion are usually the same ones used in the search and so are still in memory
          2. indrementally refines structure preventing gradual deteriorations

    4. R*-Trees
      1. Main Contributions
        1. Forced reinsert of p entries when have overflow - this single change gives the best performance enhancement
        2. Different parameters to be optimised
          1. Area (Glutman's R-Trees)
          2. Margin (Perimeter)
          3. Overlap
          4. Utilisation (Storage)
            • R-trees with larger m (minFill) therefore each node is closer to MaxFill
            • m ≈ 40% M as optimum - they set m early on and played with a-c
            • BUT this conflicts with a-c

      2. Insert
      3. Insert(E)
            ChooseLeaf -> L
            if num entries in L < M
                install E
            else
                if L is the root or ReInsert already performed 
        	    Split 
        	    create new root if necessary
        	else
        	    ReInsert(L+E)
            AdjustTree(L)
            if root split
                put as children of new root
        
        ChooseLeaf
            N = root node
            repeat
                if N is leaf, return N
                else
        	   if the nodes in N point to leaves
        	       (minmise the overlap cost)
        	       choose F in N such that rectangle, FI, needs the least
        	       overlap enlargement to include the new data (if tie, 
        	       choose F with the smallest Area enlargement)
        	   else if the nodes in N point to other nodes
        	       (minimise the Area cost - same as before)
        	       choose F in N such that index, FI, would require least
        	       enlargement to accomodate E (if tie, choose F with the 
        	       smallest bounding rectangle, FI)
        	   N = FP
        

      4. Point Queries
      5. Point Query
        non-deterministic - depends on the order in which the elements were inserted

      6. Area Queries
      7. Point Query

        Therefore as the size of the query increases it should be better to minimise the perimeter total. So, if doing mainly point queries, minimise area, and if doing mainly area/range/spatial queries minimise perimeter.

        We prefer square rectangles because this is the largest amount of area for the smallest perimeter

    5. Packed R-Trees
      1. Preprocessing
      2. Packing algorithms are used when you have alot of initial static data with which to create your R-tree. In this way you can sort all your data and then pack it into the leaves, then build from the bottom-up the rest of the R-tree.
        1. Preprocess data file of N objects into ciel(N/M) groups of M rectangles (the last group may not be full).
        2. Load each group into a page and output the <MBR, page_number> (MBR is the minimum bounding rectangle).
        3. Recursively pack these MBR's. Use the page_numbers for pointers from parents.
        The only difference between the three packing algorithms below is how you order the rectangles for putting into the leaves for the first step.

      3. Hilbert space-filling
        • Uses how far along the Hilbert fractal line each rectangle's centre is.
        • Hilbert fractals of order 1, 2, 3 and 4 are shown below:
          Hilbert 1 Hilbert 2 Hilbert 3 Hilbert 4

      4. Nearest-X algorithm
        • very simple - all rectangles ordered by the x-coordinate of their centre
        • creates long skinny bounding rectangles, therefore perimeter not optimised and so not so efficient for range queries (OK for point queries though)

      5. STR (Sort-Tile-Recursive) algorithm
        • First, determine how many nodes will be needed: num_nodes = ciel(N/M). e.g. if we have 100 rectangles/objects and our max_fill = M = 8, then we will needs a total of ciel(100/8) = 13 nodes (leaves in this case as we are doing the bottom level of the R-tree).
        • Then, since we want to tile these nodes into an x by x matrix, we first calculate ` x = ciel(sqrt(num_nodes)). We start with x vertical strips, so we calculate how many rectangles go into each strip, v_fill = M*x. e.g. x = ciel(sqrt(13))=4, so we split the data up along the x-axis into 4 groups. In each strip there will be v_fill = 4*8 = 32 rectangles.
        • In each vertical strip, beginning at the top and working your way down (i.e. sorting on y-value), put M rectangles into each tile/node. e.g. In our first vertical strip we place 8 rectangles, in the second 8 and so on. However, in the last vertical strip there will be a total of only 4 rectangles, so these all go into the same node.
        • Using the MBR's from the leaf nodes, recursively tile your way up the R-tree till you reach the root node. e.g. We will need a total of ciel(13/8)=2 nodes for the next node up, and the next node above that we will only need 1 node so that is the root node.

        The STR authors used small buffer (LRU: least recently used) sizes for testing so they could mimic large data collections. They found no clear winner between the packing algorithms, although Nearest-X is only good for point queries, and STR is simpler than Hilbert space-filling (HS). Their statistics were flawed as they assumed a normal distribution when calculating how well STR had performed over HS.

    return to top