SSAS Tabular Modeling

One Giant Model vs. Splitting Up into Multiple Models

One of my clients requested I build a data warehouse in SSAS Tabular contained in a single model. I felt like that might not be a good idea, but I needed supporting arguments. Fortunately, I came across this topic in social media. It was a question submitted to @GuyInACube who asked his guest, Marco Russo

“Is it OK to build one giant data model, or should you split it up into multiple models?”

@GuyInACube

Marco Russo answered quickly, without a pause, “No” [you should not make one big model].

His reasoning against one giant model is non-technical. Here is an example. The same entity, or measure, even, might have different names or different meanings by different departments in the same company. They might even need to be defined with different code!

That problem is exactly what I experienced when trying to develop within one large model. The DW contained HR data, sales and commissions data, revenue, service, and utilization facts.

Right away, I noticed a difference in the definitions of sales type and revenue type. The rules classifying sales revenue types and actual revenue types (in the general ledger) were different, but they asked for one model, but they wanted departments, when using the model, to see 100% agreement with their existing transactional reports.

Having your analytic data match transactional reports is necessary; I agree. If they don’t match, the users won’t have confidence in the data. I solved this problem the only way that I could solve it.

I created separate measures:

  • Sales Revenue
  • Company Revenue

I also created separate dimensions:

  • Sales Revenue Type
  • Company Revenue Type

I also created perspectives: Sales, Corporate, Fleet, HR. This helped to reduce the confusion, but there were a few users who preferred to access the entire model, and in my opinion, this caused problems with user-designed reports.

The best approach, however, is to create one giant SQL data warehouse out of which you publish multiple data marts, each data mart having its own SSAS Tabular Model.

By |2021-06-22T17:41:13-04:00August 27, 2020|SSAS Tabular|0 Comments

Infected by COVID Data

In March 2020, when the USA shut down to combat the Corona Virus, my workload dried up (like it did for many people). I used some of my newfound extra time to work with the publicly available COVID data. Here is how that unfolded.

Part 1. Getting Started Was Easy

Thanks to all the people behind the COVID Tracking Project, it was easy for me to get started. I clicked a few pages on the CTP site, and I noticed they had an API that returned JSON data, and I knocked out a quick SSIS package to import USA COVID data into SQL.

I used a third-party SSIS component from ZappySys, their JSON Parser Transform object. With that, pulling this data into SQL was pretty easy.

The only “snag” was due to the way they recorded the values each day — as cumulative totals per state. That makes some graphing easier for some people, but it’s not really the way I want to store data in a DW, so I used a SQL windowing function to get the daily totals and I made fact tables that included daily counts instead of cumulative to date. (Line 144, below)

Part 2. Keeping Up With the Joneses

After I pulled in the CTP data, I ran some quick reports in Excel just to satisfy my curiosity. At this point in the United States, the reported data was really limited to the overall cases and deaths in the USA and in New York. I got to thinking I could build something quickly in Power BI and share it. So I looked at the Power BI Community pages.

When I perused the gallery I made a few conclusions:

  1. There weren’t many dashboards (27 total).
  2. They didn’t have useful measures, just totals.
  3. I needed world-wide data.
  4. I also needed county-level granularity (for the USA).

So instead of creating visualizations with the CTP data, I decided to use another dataset, the Johns Hopkins data which has US counties and also world-wide data.

Part 3. Keeping Up With Johns Hopkins

Dear John,

Thank you for all your hard work, but…

Yes, it is borderline ungrateful to joke about writing a “Dear John” letter to “break up” with Johns Hopkins. After all, they’ve performed a valuable public service. However, the data they provide has had some quality issues.

This image of SSIS packages shows that I wrote four different packages to import JHU data because they made several changes to the file format, and when they made a change, they did not re-export all the old data in the new format. For maximum flexibility, my solution can re-load all of the data from scratch into a blank database. The only way to achieve that is to maintain a package dedicated to handle each format (as I did).

Perhaps that wasn’t a huge inconvenience, except that I wrote a package to log the flat files as they were made available, and my logging package would need to hand off to the correct import component.

Visual Studio includes a File System task, but I opted to use (again) a component from ZappySys, their Advanced File System Task which is much richer than the standard one in Visual Studio.

Logging the flat files is really overkill in a project like this, but it gives me a chance to show an example of how to do this sort of thing. I went down this road before I discovered that JHU was changing their file format over time, and now I needed another table that would determine what format each file was based on. Here is what that looks like:

Those minor gripes about file formats were only foreshadowing to the real problems in the data. JHU was continually changing the names of the locations. Usually each change was in the direction of improvement, but when they changed the data, the previously provided data was not updated (despite the data being published in a Git repository, which can handle versioning). Therefore, I had to create a crosswalk to master the data. Here is a portion of the crosswalk.

The columns in blue, on the left, are the raw values in the JHU data. The column in yellow, to the right, is the mastered value that would be stored in the data warehouse. Keep in mind, without such a process, all of the records for Bahamas (for example) would not fall in the same bucket.

I did more than just correct the inconsistencies. I also split outlying territories into countries. It was my thought that in the context of pandemic data, you don’t want the UK data to contain Cayman Islands. The same goes for Netherlands and Aruba. For what its worth, I did also provide a mechanism to re-join territories to their colonial parent.

Part 4. Enriching The Data

Because I intended to bring all of this into a useful dashboard, I gathered a fair amount of related data.

  • Population — necessary for per capita measures
  • Latitude and Longitude — often included, but sometimes incorrect
  • Continent and Region

When complete, those became attributes in the Country dimension table. It looks like this:

I constructed the same for USA States and USA Counties. Those dimension tables are wider than the country table, so I am showing just the first record from each, listing the columns vertically:

So far I’ve covered countries, states, and counties. Another useful dimension is the “metropolitan area,” often called an MSA, and more accurately labeled as CBSA (core based statistical area). Metropolitan areas are often how we think of our cities in the USA. I could relate the county data to CBSA data, so I built a dimension for USA Metro Areas, too. Here it is:

Notice how the MSA Names are really long? That is all I could get for names in the official data, but for reporting I really wanted short names so I had to look those up online. I felt it was important to know that Seattle is referred to as “Seattle Metro” while Detroit goes by “Metro Detroit.” There were a few that didn’t have a clear standard name, so I picked the best name by sampling various sources online.

Part 5. Completing the SQL DW

After adding a dimension for USA Metro Areas I had a fairly complete set of data that I could design proper measures for. I haven’t really covered this, but I designed it for a daily refresh:

  1. Git desktop client would pull refreshed JHU data
  2. SSIS logged every new or changed file
  3. SSIS imported new files from JHU
  4. SSIS pulled new data from CTP
  5. A stored procedure staged data and built fact records.

The solution, thus far, was a fairly proper data warehouse based on the dimensional model (Kimball), and I could, then, take it into SSAS Tabular where I could define measures that can be used in reports and dashboards. That’s exactly what I did. In Visual Studio, the Tabular model looks like this:

Quick Rundown: The top table is my calendar table, sometimes it’s called DIM_Date or DIM_Time in a SQL data warehouse. The calendar is a dimension, of course; as are the bottom four tables, all dimensions: Countries, US States, US Counties, and US Metropolitan Areas.

Across the middle, there are three fact tables: COVID Worldwide, COVID in US States, and COVID in Counties and Metros. I should say that in SSAS Tabular, it’s technically improper to continue using the terms “fact” and “dimension.” We don’t need them any more.

Part 6. Remodeling Before It’s Complete

My re-telling of this is a little bit out of synch with the actual re-modeling of data. Truth be told, I skipped the part where I originally built a single, fact table having multiple grains. Each record of COVID data could be a country, state or province, or a US county. I made it work, but then when I started writing measures to calculate per capita data, the coding was unnecessarily complex. The advantage of having a single table was that I could easily compare a county or a state with a country. I also liked the challenge of doing something difficult.

I decided, wisely, to drop it and write separate facts for each level of granularity: county, state, and country. The thing that really stinks about this is that when creating reports, you have to choose among different measures for the same thing, Country Cases for a country and State Cases for a US state. That might not sound like a big deal, but if you create a country visualization and want to copy/paste it for the basis of a state visualization, after pasting, you need to delete the measure and replace it.

There is a really good solution for this, though. You can create a single measure for use that will call the correct measure that is needed. I did that, and I added empty tables to hold the new measures, now grouping them by what they are (cases or deaths) instead of what they belong to (countries or states).

Visual Studio is not always the best tool for these kinds of edits to SSAS Tabular projects, but the application, Tabular Editor works nicely. Here are the changes I just described, shown in Tabular Editor:

Tabular Editor showing new tables for measures in green. The old tables, highlighted yellow, are hidden from report authors. The new measures just figure out which old measure to call.

Part 7. Darkness at the End of the Tunnel

When I had the first inclination to work on the COVID data I went from idea to visualization in about an hour. After that first Excel chart, completed on April 21, 2020, I dedicated myself to build a proper data warehouse and created no visualizations until I was done with it two weeks later.

I got started on a worldwide dashboard first. Instead of just giving case counts and deaths by country, I decided to create new measures such as Count of Countries with Cases Rising (week over week), and Count of Countries with Cases Falling. Being immersed in this data caused a real malaise.

It’s hard to communicate what it was like except to point out that I was using all the same tools, and practicing the same methodology that I’ve used for my client work of more than ten years. Day in and day out I worked on the COVID data like it was my job. From every angle this project was just like any other, except for some glaring details:

  • First and foremost, I see clearly in the data that the USA has done poorly, and my home state is among the worst in the country.
  • The available data is full of undercounts at every level due to for various reasons — some innocent, some malevolent, some just due to lag.
  • A significant portion of the population believe the opposite is true — that the numbers are over counted, and the pandemic is a hoax.

I’m not one of those people. I lost a family member to COVID, and I am concerned for the future, not just for our health, but our economic well being, too. Below is a chart I made with Tableau, ranking the states by cases per capita over the last sixty days, June 27 through August 25, 2020.

Thank you for your attention. If you are so inclined, I would be more than happy to hear from you.

By |2020-09-15T16:04:33-04:00August 27, 2020|Business Intelligence|0 Comments

Using Derived Tables in SQL

Common Table Expressions (CTE) and Derived Tables are very useful when you are reporting from transactional tables as opposed to analytical, or OLAP tables (i.e. SQL Server Analysis Services, or SSAS). The reason being is this, while OLAP tables are built for the purpose of summaries and reporting, transactional tables typically do not summarize records, instead they contain records of transactions or records of the objects that conducted the transactions (and/or related info). I’ll explain with an example.

Imagine a table of a team’s baseball games. Each record contains the team, it’s opponent, and the result (win or loss). You can count all games played by a team very simply:

SELECT TeamName, count(*) as GameCount
FROM Game
GROUP by TeamName

You can also count all the wins by each team, just by adding a WHERE clause, like this:

SELECT TeamName, count(*) as GameCount
FROM Game
WHERE Result = 'Win'
GROUP by TeamName

Problem: If you want to produce a report of each team’s won-loss record, it would take two passes at the data — first, get the game count, and then get the win count (and subtract games minus wins to get the losses).

Solution #1: Derived Tables

If you use a derived table, you can produce each team’s won-loss record with one SQL statement. To do this, take one of the statements above and use it almost like it exists a table in the SQL database. Here is what it looks like:

SELECT TeamGames.Team
   , WinCount as Wins
   , count(*) - WinCount as Losses
FROM Game as TeamGames
JOIN (
    SELECT Team, count(*) as WinCount
    FROM GameResult
    WHERE Result = 'Win'
    GROUP by Team
    ) as TeamWins on TeamGames.Team = TeamWins.Team
GROUP by TeamGames.Team, WinCount

See how I joined onto the second statement as if it was table within the parentheses? There is no TeamWins summary table, but the second statement is what a TeamWins summary table would look like. Therefore, we join onto it by placing the code from the statement in between the parentheses. Then we name the table, and provide the join criteria.

Actually, there is a problem with the third statement. Any team that has not won a game will not show up at all in the results. This is because our join statement is implied as an INNER JOIN in T-SQL. We need to make a few changes to perfect this, first by using a LEFT OUTER JOIN and then by adding the isNULL function to convert any NULLs into zeroes. Here is the finished code:

SELECT TeamGames.Team
   , isNULL(WinCount,0) as Wins
   , count(*) - ISNULL(WinCount,0) as Losses
FROM Game as TeamGames
LEFT OUTER JOIN (
    SELECT Team, count(*) as WinCount
    FROM Game
    WHERE Result = 'Win'
    GROUP by Team
    ) as TeamWins on TeamGames.Team = TeamWins.Team
GROUP by TeamGames.Team, WinCount

Solution #2: Common Table Expressions

Derived tables have been used in SQL for a long time, but the newer way to do this is via a CTE. This syntax is part of the ANSI SQL 99 specification and T-SQL (Microsoft) beginning with version 2005. In a CTE, you define the table as an expression in the beginning of the statement as shown in this code:

WITH CTE_TeamWins as (
                      SELECT Team, count(*) as WinCount
                      FROM GameResult
                      WHERE Result = 'Win'
                      GROUP by Team
                      )
SELECT TeamGames.Team
   , WinCount as Wins
   , count(*) - WinCount as Losses
FROM Game as TeamGames
JOIN TeamWins on TeamGames.Team = TeamWins.Team
GROUP by TeamGames.Team, WinCount

With the example above, it’s hard to tell why the CTE method is typically preferred over derived tables. The CTE actually uses more code, and — in my opinion — is less readable for this very simple example. However, as things become more complex, the advantage of using CTE instead of derived tables becomes more apparent.

Here are advantages of using a CTE versus a derived table:

  • CTE can be referenced multiple times in the statement, but a derived table would need to be typed out for each reference.
  • You can create CTE that reference CTE, keeping your code clean, and easier to read.
  • You can create recursive CTE which is useful to traverse hierarchies.

Word to the Wise – Performance Issues

You can use CTE for some reporting like the examples shown above. However, use of CTE (or derived tables) in large ETL solutions can cause performance issues. In those cases, it’s typically better to use temporary tables in place of CTE.


					
By |2022-10-17T14:57:29-04:00February 24, 2014|Data Warehousing, SQL|0 Comments