Purchase your Section 508 Compliance Support guide now!

Purchase your Section 508 Compliance Support guide now!

Data modeling 101

Data modeling is the act of exploring data-oriented structures. Like other modeling artifacts data models can be used for a variety of purposes, from high-level conceptual models to physical data models. From the point of view of an object-oriented developer data modeling is conceptually similar to class modeling. With data modeling you identify entity types whereas with class modeling you identify classes. Data attributes are assigned to entity types just as you would assign attributes and operations to classes. There are associations between entities, similar to the associations between classes – relationships, inheritance, composition, and aggregation are all applicable concepts in data modeling.

Traditional data modeling is different from class modeling because it focuses solely on data – class models allow you to explore both the behavior and data aspects of your domain, with a data model you can only explore data issues. Because of this focus data modelers have a tendency to be much better at getting the data “right” than object modelers. However, some people will model database methods (stored procedures, stored functions, and triggers) when they are physical data modeling. It depends on the situation of course, but I personally think that this is a good idea and promote the concept in my UML data modeling profile (more on this later).

Although the focus of this article is data modeling, there are often alternatives to data-oriented artifacts (never forget Agile Modeling’s Multiple Models principle). For example, when it comes to conceptual modeling ORM diagrams aren’t your only option – In addition to LDMs it is quite common for people to create UML class diagrams and even Class Responsibility Collaborator (CRC) cards instead. In fact, my experience is that CRC cards are superior to ORM diagrams because it is very easy to get project stakeholders actively involved in the creation of the model. Instead of a traditional, analyst-led drawing session you can instead facilitate stakeholders through the creation of CRC cards.

How are Data Models Used in Practice?

Although methodology issues are covered later, we need to discuss how data models can be used in practice to better understand them. You are likely to see three basic styles of data model:

  • Conceptual data models. These models, sometimes called domain models, are typically used to explore domain concepts with project stakeholders. On Agile teams high-level conceptual models are often created as part of your initial requirements envisioning efforts as they are used to explore the high-level static business structures and concepts. On traditional teams conceptual data models are often created as the precursor to LDMs or as alternatives to LDMs.

  • Logical data models (LDMs). LDMs are used to explore the domain concepts, and their relationships, of your problem domain. This could be done for the scope of a single project or for your entire enterprise. LDMs depict the logical entity types, typically referred to simply as entity types, the data attributes describing those entities, and the relationships between the entities. LDMs are rarely used on Agile projects although often are on traditional projects (where they rarely seem to add much value in practice).

  • Physical data models (PDMs). PDMs are used to design the internal schema of a database, depicting the data tables, the data columns of those tables, and the relationships between the tables. PDMs often prove to be useful on both Agile and traditional projects and as a result the focus of this article is on physical modeling.

Although LDMs and PDMs sound very similar, and they in fact are, the level of detail that they model can be significantly different. This is because the goals for each diagram is different – you can use an LDM to explore domain concepts with your stakeholders and the PDM to define your database design. Figure 1 presents a simple LDM and Figure 2 a simple PDM, both modeling the concept of customers and addresses as well as the relationship between them. Both diagrams apply the Barker notation, summarized below. Notice how the PDM shows greater detail, including an associative table required to implement the association as well as the keys needed to maintain the relationships. More on these concepts later. PDMs should also reflect your organization’s database naming standards, in this case an abbreviation of the entity name is appended to each column name and an abbreviation for “Number” was consistently introduced. A PDM should also indicate the data types for the columns, such as integer and char(5). Although Figure 2 does not show them, lookup tables (also called reference tables or description tables) for how the address is used as well as for states and countries are implied by the attributes ADDR_USAGE_CODE, STATE_CODE, and COUNTRY_CODE.

Discussing common data modeling notations.

Notation

Comments

IE

The IE notation (Finkelstein 1989) is simple and easy to read, and is well suited for high-level logical and enterprise data modeling. The only drawback of this notation, arguably an advantage, is that it does not support the identification of attributes of an entity. The assumption is that the attributes will be modeled with another diagram or simply described in the supporting documentation.

Barker

The Barker notation is one of the more popular ones, it is supported by Oracle’s toolset, and is well suited for all types of data models. It’s approach to subtyping can become clunky with hierarchies that go several levels deep.

IDEF1X

This notation is overly complex. It was originally intended for physical modeling but has been misapplied for logical modeling as well. Although popular within some U.S. government agencies, particularly the Department of Defense (DoD), this notation has been all but abandoned by everyone else. Avoid it if you can.

UML

This is not an official data modeling notation (yet). Although several suggestions for a data modeling profile for the UML exist, none are complete and more importantly are not “official” UML yet. However, the Object Management Group (OMG) in December 2005 announced an RFP for data-oriented models.

Data modeling 101

Data modeling is the act of exploring data-oriented structures. Like other modeling artifacts data models can be used for a variety of purposes, from high-level conceptual models to physical data models. From the point of view of an object-oriented developer data modeling is conceptually similar to class modeling. With data modeling you identify entity types whereas with class modeling you identify classes. Data attributes are assigned to entity types just as you would assign attributes and operations to classes. There are associations between entities, similar to the associations between classes – relationships, inheritance, composition, and aggregation are all applicable concepts in data modeling.

Traditional data modeling is different from class modeling because it focuses solely on data – class models allow you to explore both the behavior and data aspects of your domain, with a data model you can only explore data issues. Because of this focus data modelers have a tendency to be much better at getting the data “right” than object modelers. However, some people will model database methods (stored procedures, stored functions, and triggers) when they are physical data modeling. It depends on the situation of course, but I personally think that this is a good idea and promote the concept in my UML data modeling profile (more on this later).

Although the focus of this article is data modeling, there are often alternatives to data-oriented artifacts (never forget Agile Modeling’s Multiple Models principle). For example, when it comes to conceptual modeling ORM diagrams aren’t your only option – In addition to LDMs it is quite common for people to create UML class diagrams and even Class Responsibility Collaborator (CRC) cards instead. In fact, my experience is that CRC cards are superior to ORM diagrams because it is very easy to get project stakeholders actively involved in the creation of the model. Instead of a traditional, analyst-led drawing session you can instead facilitate stakeholders through the creation of CRC cards.

How are Data Models Used in Practice?

Although methodology issues are covered later, we need to discuss how data models can be used in practice to better understand them. You are likely to see three basic styles of data model:

  • Conceptual data models. These models, sometimes called domain models, are typically used to explore domain concepts with project stakeholders. On Agile teams high-level conceptual models are often created as part of your initial requirements envisioning efforts as they are used to explore the high-level static business structures and concepts. On traditional teams conceptual data models are often created as the precursor to LDMs or as alternatives to LDMs.

  • Logical data models (LDMs). LDMs are used to explore the domain concepts, and their relationships, of your problem domain. This could be done for the scope of a single project or for your entire enterprise. LDMs depict the logical entity types, typically referred to simply as entity types, the data attributes describing those entities, and the relationships between the entities. LDMs are rarely used on Agile projects although often are on traditional projects (where they rarely seem to add much value in practice).

  • Physical data models (PDMs). PDMs are used to design the internal schema of a database, depicting the data tables, the data columns of those tables, and the relationships between the tables. PDMs often prove to be useful on both Agile and traditional projects and as a result the focus of this article is on physical modeling.

Although LDMs and PDMs sound very similar, and they in fact are, the level of detail that they model can be significantly different. This is because the goals for each diagram is different – you can use an LDM to explore domain concepts with your stakeholders and the PDM to define your database design. Figure 1 presents a simple LDM and Figure 2 a simple PDM, both modeling the concept of customers and addresses as well as the relationship between them. Both diagrams apply the Barker notation, summarized below. Notice how the PDM shows greater detail, including an associative table required to implement the association as well as the keys needed to maintain the relationships. More on these concepts later. PDMs should also reflect your organization’s database naming standards, in this case an abbreviation of the entity name is appended to each column name and an abbreviation for “Number” was consistently introduced. A PDM should also indicate the data types for the columns, such as integer and char(5). Although Figure 2 does not show them, lookup tables (also called reference tables or description tables) for how the address is used as well as for states and countries are implied by the attributes ADDR_USAGE_CODE, STATE_CODE, and COUNTRY_CODE.

Discussing common data modeling notations.

Notation

Comments

IE

The IE notation (Finkelstein 1989) is simple and easy to read, and is well suited for high-level logical and enterprise data modeling. The only drawback of this notation, arguably an advantage, is that it does not support the identification of attributes of an entity. The assumption is that the attributes will be modeled with another diagram or simply described in the supporting documentation.

Barker

The Barker notation is one of the more popular ones, it is supported by Oracle’s toolset, and is well suited for all types of data models. It’s approach to subtyping can become clunky with hierarchies that go several levels deep.

IDEF1X

This notation is overly complex. It was originally intended for physical modeling but has been misapplied for logical modeling as well. Although popular within some U.S. government agencies, particularly the Department of Defense (DoD), this notation has been all but abandoned by everyone else. Avoid it if you can.

UML

This is not an official data modeling notation (yet). Although several suggestions for a data modeling profile for the UML exist, none are complete and more importantly are not “official” UML yet. However, the Object Management Group (OMG) in December 2005 announced an RFP for data-oriented models.

design a database to be “performance friendly” to reports

What can you do design a database to be “performance friendly” to reports? In an ideal world you would like to have a perfectly normalized database, but it isn’t an ideal world. To support reports you often need to consider the following changes to your database design:

  • Take advantage of database features. Each database implements joins, indices, SQL Select statement execution, and access paths in slightly different ways. All of these things influence the performance of your queries and thus your reports. This sort of basic knowledge is taught in database certification courses and I highly recommend that all Agile DBAs become certified on the database technologies that they work with. The disadvantage is that this approach helps to lock you into your database vendor because you come to rely on the unique features that they provide.

  • Introduce aggregate tables. An aggregate table stores denormalized copies of data. For example, a CustomerOrders aggregate table would store summary totals of the orders for customers. There would be one row for each customer which records the number of orders placed, the number of shipments made, the grand total of the orders for that customer, and so on. The primary disadvantage is that you need to maintain the aggregate table.

  • Remove unnecessary data. The smaller the amount of data to process, the faster your queries will run. By removing unnecessary data, either by archiving it or simply deleting it, you can improve the performance of your reports. The disadvantages are that you need to develop an archiving/deletion strategy and in the case of archiving you also need to develop a date recovery process to support retrieval of archived data.

  • Caching. Caches, either of data or of objects, can dramatically improve your system performance by replacing relatively slow disk accesses with memory accesses. The disadvantages are the increased complexity and the increased chances of cross schema referential integrity problems introduced by caches.

  • Partition a table. The goal is to take a large table, which results in poor performance, and reorganize it into several smaller tables. Tables can be partitioned vertically by storing different columns in each table as well as horizontally by storing collections of rows in different tables. Combinations are also possible. The primary disadvantages are that partitioning will complicate your mapping efforts (Chapter 11) and queries can become more complex because you need to work with several tables to support a single concept.

  • Disallow real-time reports. Many organizations choose to only support batch reporting against databases to ensure that report queries do not interfere with operational applications, to ensure consistent performance levels within data marts and data warehouses, and to reserve update windows for those databases.

  • Introduce indices. If a report needs to obtain data in a different order in which it is stored a common way to support this is to introduce an index that access the data in the required order. The disadvantage is that this slows down run-time performance due to the need to update the additional index.

design a database to be “performance friendly” to reports

What can you do design a database to be “performance friendly” to reports? In an ideal world you would like to have a perfectly normalized database, but it isn’t an ideal world. To support reports you often need to consider the following changes to your database design:

  • Take advantage of database features. Each database implements joins, indices, SQL Select statement execution, and access paths in slightly different ways. All of these things influence the performance of your queries and thus your reports. This sort of basic knowledge is taught in database certification courses and I highly recommend that all Agile DBAs become certified on the database technologies that they work with. The disadvantage is that this approach helps to lock you into your database vendor because you come to rely on the unique features that they provide.

  • Introduce aggregate tables. An aggregate table stores denormalized copies of data. For example, a CustomerOrders aggregate table would store summary totals of the orders for customers. There would be one row for each customer which records the number of orders placed, the number of shipments made, the grand total of the orders for that customer, and so on. The primary disadvantage is that you need to maintain the aggregate table.

  • Remove unnecessary data. The smaller the amount of data to process, the faster your queries will run. By removing unnecessary data, either by archiving it or simply deleting it, you can improve the performance of your reports. The disadvantages are that you need to develop an archiving/deletion strategy and in the case of archiving you also need to develop a date recovery process to support retrieval of archived data.

  • Caching. Caches, either of data or of objects, can dramatically improve your system performance by replacing relatively slow disk accesses with memory accesses. The disadvantages are the increased complexity and the increased chances of cross schema referential integrity problems introduced by caches.

  • Partition a table. The goal is to take a large table, which results in poor performance, and reorganize it into several smaller tables. Tables can be partitioned vertically by storing different columns in each table as well as horizontally by storing collections of rows in different tables. Combinations are also possible. The primary disadvantages are that partitioning will complicate your mapping efforts (Chapter 11) and queries can become more complex because you need to work with several tables to support a single concept.

  • Disallow real-time reports. Many organizations choose to only support batch reporting against databases to ensure that report queries do not interfere with operational applications, to ensure consistent performance levels within data marts and data warehouses, and to reserve update windows for those databases.

  • Introduce indices. If a report needs to obtain data in a different order in which it is stored a common way to support this is to introduce an index that access the data in the required order. The disadvantage is that this slows down run-time performance due to the need to update the additional index.

Comparing Database Types

Operational Database

Data Mart

Data Warehouse

Highly normalized

Some summary data for online reporting

Requirements driven

Supports read/write access by applications

Specific to one or more applications

Operational applications typically work with this type of database

Highly denormalized

Requirements driven

Specific to a single department and/or collection of application reports

Contains a snapshot of near-term information, typically less than a year old

Supports read-only access by applications

Often uses multi-dimensional database management system

Supports very flexible and unpredictable access to data

Ad-hoc reporting facilities typically work with this type of database

Highly normalized

May some summary data

Flexible design which stores granular data

Supports read-only access by applications

Not specific to any application or department

Driven by enterprise-level requirements

Contains massive amounts of data, typically an order of magnitude greater than a data mart or operational database

Often includes several (upwards to ten) years worth of data

“Standard” reports are developed and run on a regular basis against this type of database

Comparing Database Types

Operational Database

Data Mart

Data Warehouse

Highly normalized

Some summary data for online reporting

Requirements driven

Supports read/write access by applications

Specific to one or more applications

Operational applications typically work with this type of database

Highly denormalized

Requirements driven

Specific to a single department and/or collection of application reports

Contains a snapshot of near-term information, typically less than a year old

Supports read-only access by applications

Often uses multi-dimensional database management system

Supports very flexible and unpredictable access to data

Ad-hoc reporting facilities typically work with this type of database

Highly normalized

May some summary data

Flexible design which stores granular data

Supports read-only access by applications

Not specific to any application or department

Driven by enterprise-level requirements

Contains massive amounts of data, typically an order of magnitude greater than a data mart or operational database

Often includes several (upwards to ten) years worth of data

“Standard” reports are developed and run on a regular basis against this type of database

Remote Cognos Development Opportunity

Does your organization require temporary support doing Cognos development? If so, then the BI Centre can give you that short term, or long term, support to help you meet your deliverable.

We can assist with model development, report development, SDK coding, testing, QA, technical documentation and even project management.

Feel free to contact us today to discuss your needs.

email: businessintelligencecentre@gmail.com

Remote Cognos Development Opportunity

Does your organization require temporary support doing Cognos development? If so, then the BI Centre can give you that short term, or long term, support to help you meet your deliverable.

We can assist with model development, report development, SDK coding, testing, QA, technical documentation and even project management.

Feel free to contact us today to discuss your needs.

email: businessintelligencecentre@gmail.com