Database Normalization is a technique for designing relational database tables to minimize duplication of information and, in so doing, to safeguard the database against certain types of logical or structural problems, namely data anomalies. For example, when multiple instances of a given piece of information occur in a table, the possibility exists that these instances will not be kept consistent when the data within the table is updated, leading to a loss of data integrity. A table that is sufficiently normalized is less vulnerable to problems of this kind, because its structure reflects the basic assumptions for when multiple instances of the same information should be represented by a single instance only.
Higher degrees of normalization typically involve more tables and create the need for a larger number of joins, which can reduce performance. Accordingly, more highly normalized tables are typically used in database applications involving many isolated transactions (e.g. an Automated teller machine), while less normalized tables tend to be used in database applications that do not need to map complex relationships between data entities and data attributes (e.g. a reporting application, or a full-text search application).
Database theory describes a table’s degree of normalization in terms of normal forms of successively higher degrees of strictness. A table in third normal form (3NF), for example, is consequently in second normal form (2NF) as well; but the reverse is not always the case.
Although the normal forms are often defined informally in terms of the characteristics of tables, rigorous definitions of the normal forms are concerned with the characteristics of mathematical constructs known as relations. Whenever information is represented relationally, it is meaningful to consider the extent to which the representation is normalized.
A table that is not sufficiently normalized can suffer from logical inconsistencies of various types, and from anomalies involving data operations. In such a table:
- The same information can be expressed on multiple records; therefore updates to the table may result in logical inconsistencies. For example, each record in an “Employees’ Skills” table might contain an Employee ID, Employee Address, and Skill; thus a change of address for a particular employee will potentially need to be applied to multiple records (one for each of his skills). If the update is not carried through successfully—if, that is, the employee’s address is updated on some records but not others—then the table is left in an inconsistent state. Specifically, the table provides conflicting answers to the question of what this particular employee’s address is. This phenomenon is known as an update anomaly.
- There are circumstances in which certain facts cannot be recorded at all. For example, each record in a “Faculty and Their Courses” table might contain a Faculty ID, Faculty Name, Faculty Hire Date, and Course Code—thus we can record the details of any faculty member who teaches at least one course, but we cannot record the details of a newly-hired faculty member who has not yet been assigned to teach any courses. This phenomenon (fact/ event) is known as an insertion anomaly.
- There are circumstances in which the deletion of data representing certain facts necessitates the deletion of data representing completely different facts. The “Faculty and Their Courses” table described in the previous example suffers from this type of anomaly, for if a faculty member temporarily ceases to be assigned to any courses, we must delete the last of the records on which that faculty member appears. This phenomenon is known as a deletion anomaly.
Ideally, a relational database table should be designed in such a way as to exclude the possibility of update, insertion, and deletion anomalies. The normal forms of relational database theory provide guidelines for deciding whether a particular design will be vulnerable to such anomalies. It is possible to correct an unnormalized design so as to make it adhere to the demands of the normal forms: this is called normalization.
Normalization typically involves decomposing an unnormalized table into two or more tables that, were they to be combined (joined), would convey exactly the same information as the original table.
A table-like structure of rows and columns, where the rows store data about an entity and the columns store data about the attributes of that entity. Each column has a unique name, and all values in a column are for the same attribute. The cells in the table can only hold a single value. The order of the rows doesn’t matter, and neither does the order of the columns. The data in each row as a whole must be unique.
A relationship between attributes in a relation where the value (or values) of one (or more) attributes determines the value (or values) of another attribute (or set of attributes). For example, the functional dependency A B expresses the fact that if we know a value of A we will always know the corresponding value of B.
The attribute or set of attributes that functionally determine the value of another attribute or set of attributes. The “left-hand side” of a functional dependency. For example, in the functional dependency A B, A is the determinant.
An attribute or set of attributes that uniquely determines the values of all the other attributes for a row in a relation.
A key that contains two or more attributes.
The candidate key selected to be the “official” key of a relationship.
A column of artificial data added to a relation to serve as the primary key.
The attribute (or set of attributes) that is a primary key in one table that is then placed in a second table to form a relationship to the first table by storing linking values. The term foreign key refers to the attribute or set of attributes in the second table.
|Referential integrity constraint|
A value constraint of a foreign key that states that no value can be placed in the foreign key unless it already exists as a primary key value in the linked table.
One category in a set of categories (“normal forms”) used to describe relations according to the type of anomalies that can occur in the relations.
An anomaly that occurs in a relation when a determinant is matched to a set of values rather than a single value. For example, the multivalued dependency Person Sibling means that a person may have more than one sibling. This is not a problem if the multivalued dependency exists in a separate table, but can cause a problem if the multivalued dependency exists within a table with other attributes.
For a table to be a relation:
- The rows must contain data about an entity
- The columns must contain data about the attributes of the entity instances
- All entries in a column must be the same data type
- Each column must have a unique name
- Each cell in the table must hold only a single value
- The order of the columns must be unimportant
- The order of the rows must be unimportant
- The data in each row must be unique – no two rows may contain identical data