Archive for the ‘SSIS’ Category.

SSIS Container Bug

Noticed something a bit scary today. When I placed a Loop container inside a Sequence Container, my Loop package connectors just disappeared….

So pic 1 became pic 2 once i’d checked the project back in to VSS

loopwithline  loopwithoutline

So my question was…. would my loop tasks run in order or just be all over the place. Apparently it is a bug that is not yet fixed but “nothing to worry about” i.e. everything will run as designed, but you just can’t see the arrows!! If you nudge the loop container the connectors do re-appear, although I find they disappear again on VSS check-in…

Script to read string from specific position in flat file

Recently we were setting up ETL routines to read data from AS400 output flat files. That wasn’t in itself a problem however we did need to check the date of the data within the files before we ran the ETLs. If the date was not what was expected, the ETLs should not run.

Trouble is, the dates themselves were on line 4 of each of the files from char 2 to 9 in YYYYMMDD string format…. (see pic below)

textfile.PNG

Question, how to get these as proper dates into SSIS package variables so they could be used for precedence constraints.

Well I am sure there are a few ways we could have done this, but it does present a nice example of how we can use the SSIS Script component to populate variables and read files such as these.

The first thing I did was to set up some package variables as follows

  • CDFile – String
  • DealFile – String
  • FileLocation – String
  • FilesExist – Boolean
  • Load_Date_CD – Datetime
  • Load_Date_Loan – Datetime

To be able to read & write variables in the Script Task, you need to specify them in the script editor

scripted1.PNG

Notice that we read from three of the variables, but can write to the boolean FilesExist variable (which we need to do as we use this as a constraint later on). We then use the following script to check for the file existence.

Imports System
Imports System.Data
Imports System.Math
Imports System.IO
Imports Microsoft.SqlServer.Dts.Runtime

Public Class ScriptMain
' Created: Will Riley

Public Sub Main()
Dim fileLoc As String, fileName_deal As String, filename_cd As String
If Dts.Variables.Contains("User::FileLocation") = True AndAlso _
Dts.Variables.Contains("User::DealFile") = True AndAlso _
Dts.Variables.Contains("User::CDFile") = True Then
fileLoc = CStr(Dts.Variables("User::FileLocation").Value)
fileName_deal = CStr(Dts.Variables.Item("User::DealFile").Value)
filename_cd = CStr(Dts.Variables.Item("User::CDFile").Value)
System.Windows.Forms.MessageBox.Show("FileDir:" + fileLoc + "FileName:" + fileName_deal)
System.Windows.Forms.MessageBox.Show("FileDir:" + fileLoc + "FileName:" + filename_cd)
If File.Exists(fileLoc + fileName_deal) Then
If File.Exists(fileLoc + filename_cd) Then
Dts.Variables.Item("User::FilesExist").Value = True
System.Windows.Forms.MessageBox.Show("Files exist")
Else
Dts.Variables.Item("User::FilesExist").Value = False
System.Windows.Forms.MessageBox.Show("Files do not exist")
End If
Dts.TaskResult = Dts.Results.Success
Else
Dts.TaskResult = Dts.Results.Failure
End If
End If
End Sub
End Class

 

We can now use a precedence constraint so that the next step is only carried out if the FileExists variable has been set to TRUE

constraint1.PNG

So, now we can extract the dates and wite them to the two variables…. using another Script component.

Public Class ScriptMain

' The execution engine calls this method when the task executes.
' To access the object model, use the Dts object. Connections, variables, events,
' and logging features are available as static members of the Dts class.
' Before returning from this method, set the value of Dts.TaskResult to indicate success or failure.
'
' To open Code and Text Editor Help, press F1.
' To open Object Browser, press Ctrl+Alt+J.

Public Sub Main()
'

Dim file_cd As String, fileloc As String, file_loan As String
Dim strReadLine As String, filedate As Date

fileloc = CStr(Dts.Variables("User::FileLocation").Value)
file_cd = CStr(Dts.Variables.Item("User::DealFile").Value)
file_loan = CStr(Dts.Variables.Item("User::CDFile").Value)

'get the CD filedate variable

Dim oStream As New IO.StreamReader(fileloc + file_cd)
'convert the fourth line of the flat file to a date format
'line 4 always starts with capital A char so we can look for it that way
While oStream.Peek <> -1
strReadLine = oStream.ReadLine()
If Left(strReadLine, 1) = "A" Then
filedate = New DateTime( _
CInt(Strings.Mid(strReadLine, 2, 4)), _
CInt(Strings.Mid(strReadLine, 6, 2)), _
CInt(Strings.Mid(strReadLine, 8, 2)))
Exit While 'leave loop here if A found
End If
End While

'write filedate to package variable
Dts.Variables.Item("User::Load_Date_CD").Value = filedate

System.Windows.Forms.MessageBox.Show(CStr(Dts.Variables.Item("User::Load_Date_CD").Value))

'get the loan file date variable

Dim oStream1 As New IO.StreamReader(fileloc + file_loan)
'convert the fourth line of the flat file to a date format
'line 4 always starts with capital A char so we can look for it that way
While oStream1.Peek <> -1
strReadLine = oStream1.ReadLine()
If Left(strReadLine, 1) = "A" Then
filedate = New DateTime( _
CInt(Strings.Mid(strReadLine, 2, 4)), _
CInt(Strings.Mid(strReadLine, 6, 2)), _
CInt(Strings.Mid(strReadLine, 8, 2)))
Exit While 'leave loop here if A found
End If
End While

'write filedate to package variable
Dts.Variables.Item("User::Load_Date_Loan").Value = filedate

System.Windows.Forms.MessageBox.Show(CStr(Dts.Variables.Item("User::Load_Date_Loan").Value))
'
Dts.TaskResult = Dts.Results.Success
End Sub

End Class

We can then use these dates within the variables to perform other checks before running the main ETL jobs. In this way we can ensure that we do not run the same data into the warehouse database twice without the need for any manual checks. We can also ensure using various constraints that the data is loaded without missing out any dates.

Using Excel as a front end to SQL2005 Analysis Services (Part 6)

SQL Server Integrated Services (SSIS)

After designing the database the next job was to write the load program. I have seen a few systems that do this with code which looks very time consuming both from a development point of view but also for the day to day running. Fortunately SQL Server 2005 provides SSIS which is a complete revamp of the old DTS.

You access SSIS using Visual Studio 2005. For me it was the first time to open Visual Studio so it took a while to work out what Visual Studio was all about and then some time longer to figure out what to do with SSIS.

The great thing though is that SSIS is very visual. You just click and drag elements around to build up the process, stopping to define fields and write snippets of SQL. The other nice thing is that it is a very fast and controlled way to load data.

There are certain elements that you need to get to grips with. Here are a few of them.

Project – Like anything in VS this is the top level. You create a SSIS project that you keep everything in. A project contains a number of packages.

Package – A package contains a number of tasks and the order of the tasks and the logic of that order is defined in the package. Initially I only used one package but it started to become unwieldy so I soon started to split the project up into packages like :-
a. FTP Files from Transactional Server
b. Update Master Files
c. Load Transactions

Later I found the need to add utility packages eg. Delete All Records in All Tables

Task – A Task can be many things. Some examples

FTP a file,
Load a text file to a table
Send an email
Run a complete package (containing other tasks)
Run some SQL
Run a executable
Run some VB.NET script

The following is a screen shot of a package to update the analysis tables.

 SSIS Screenshot 

Each box contains an UPDATE or INSERT INTO SQL. The boxes turn yellow when executing and green if successfully complete. I managed to catch this one in the middle of processing.

Carl

Using Excel as a front end to SQL2005 Analysis Services (part 5)

The Database

Whilst probably not very time consuming, designing the database has to be the most important step. My advice would be to take your time and think carefully.

A big decision you have to make at the outset is if you are going to use autonumber IDs or not. Most “expert” db designers will tell you must use them. I have many doubts on the over use of IDs but in this sort of project the problem with autonumber Ids is that they complicate the SSIS project. Let me give you an example to demonstrate.

Suppose you load all the product information into you product table. As you do so you assign IDs for each product. The next day you receive an update file from the transactional system. The transactional system has no idea about your Ids and just passes you the product codes. Maybe somebody has changed a product description and updated the product weight field.

If your primary key is the product code you can load the product file into a temporary table and use a simple UPDATE query to update the product details and another to INSERT new ones. If, however, you have used IDs you first have to find each product in the table, find its ID and use that to update the record. OK SQL 2005 has some built in features to help you do this but if you are a simple Excel developer this is going to be a seriously steep learning curve and it may well significantly slow the loading time. If space is important and you are planning to load a few hundred million records then IDs maybe the best way forward, but if you are in that camp you probably will not be reading this. My guess is any data an Excel developer will get his/her hands is trivial for SQL 2005.

You may have guessed I opted to keep it simple and use the codes from the transactional system as primary keys.

Ok on to the design. Well I can not go through the whole thing on a blog but I can touch the main areas.

Remember we are looking at sales data. The main thing you have in sales data is headers and lines. A header is a sales invoice and a line is a product on the invoice. So the main tables are Invoice Headers and Invoice Lines. I decided to keep all the fact (numbers) data at the invoice line level only, and use the headers as an intermediate table. In OLAP terminology this is called a snowflake design. This is because much of the analysis (eg customer, date ..) is not in the main fact table but links to it through another table using a key (the invoice number). The books suggest using a Star design where possible (ie all tables link directly to the fact table). I could not think how to make this a star without seriously going against normalisation principals so I stuck with the snowflake.

For analysis I created 15 analysis tables – 5 for product analysis, 5 for customer and 5 for transactional analysis. The question here is why not create one single analysis table with a key for analysis type. Well, firstly using many tables keeps the loading simple and two it keeps the cube simple. Is it the best way ? I don’t know, but when you are starting out in a new area simple is good. As the analysis tables are small I though that it was not a big sacrifice to the normalisation ideal.

The next big question was time. I noticed in the MS sample database a time table which basically keeps a record for every day you might possibly use and keeps facts about the day in the record like the day description, the week number, the day of the month …). I know it sounds strange but now I think it is the only way to go. It was a pain creating it. Basically wherever you have a date you link it to the time table.

So in summary. No Ids just natural keys. Two big tables (headers & lines), a time table, some master tables (Products, Customers …) and lots of analysis tables.

Using Excel as a front end to SQL2005 Analysis Services (part 4)

Why use Excel….

In January 2006 I ordered SQL 2005 and set off on my first real “business intelligence/ data warehouse” project. One customer, that I have a good relationship with, wanted to understand their sales data better and was willing to contribute to development costs as long as I came up with something fast. I promised I would have a system by April and ploughed in. February and March were tough months and I worked most weekends and often 14 to 16 hour days to get there but we went live in April.

Well I guess the big question is why Excel and then why SQL Server 2005 ?

The simple answer is, that is what I knew. I have used Excel for many years and had recently learnt VBA at the back end of 2005. Looking at the data of the company I realised I need to handle between 5 and 10 million sales transactions and, after reading around, SQL 2005 seem to fit. It looked like it would be the most compatible with Excel at the front end and give me plenty of scope for further development.

It is few months on now so, would I still use Excel ? I will let you decide but here are some of the pros and cons from my point of view

Pros

Speed of implementation – you can link a few pivot tables and charts to an AS cube in minutes

Experience – you do not need a programmer to write a spreadsheet

All Microsoft – Excel links directly with SQL2005

Direction – Excel 2007 will be even better and more integrated with SQL 2005. MS is putting their money behind Excel as a BI tool

Excel’s pivot tables still appear the most flexible way to display data

Cons

Security – Excel is very weak but you do not need to store any data in the Excel “client”. Using Windows

Authentification – SQL 2005 takes over the security side.

Image – many think Excel is not up to the job.

Direction – I am not sure if ASP.NET would be a better platform

It is still a client even though it is on every desk. I would prefer a browser but find it difficult to beat the Excel Pivot Table.

What do you think ?

Carl

Using Excel as a front end to SQL2005 Analysis Services (part 3)

A Note About Performance

The performance gain from SQL 2005 Analysis Services can change the way users analyse their data. I have seen that when they receive answers in seconds rather than minutes they often use the extra time to dig much deeper into the data.

From a technical point of view, memory is very important for SQL 2005 and OLAP cubes. I have found that we need 2Gb on the machine with the cube to really start to perform. Much will depend on the machines and the network but here are a few indicative numbers.

Config : Dual core 3GHz with 2Gb memory
Average response times using a database with 8 million sales transactions :-

1. Connect to SSAS and create a Pivot Table programmatically – 3 to 4 seconds

2. Select a page item – 0.1 to 0.2 seconds

3. Drag a single dimension to another pivot area – less than 0.5 seconds

4. Drill into a multidimensional hierarchy – less than 1 second

5. Drag one multi dimensional hierarchy into a pivot area containing another multi dimensional one – 2-3 seconds

6. Analyse : To search through all dimensions looking for the contents of the active cell, decide on the type of analysis based on the dimension found, create the four OWC charts and display them in the form – 4 to 5 seconds

7. With the OWC charts you do not use so many dimensions so I have not seen any responses that go over 0.5 seconds.

When you start to drag around three or four multidimensional hierarchies (especially using multi selected items) responses slow and memory starts to peak. Fortunately very few users get that far.

I believe it is much better to use Office 2003 on the client side because of the increased limits on the Pivot Tables and memory handling. I would suggest minimum 512K memory on the client although machines with 256K do work.

Regards,
Carl

Using Excel as a front end to SQL2005 Analysis Services (part 2)

The Look and Feel

I thought I would start with a few screen shots. I can not show too much because the real data is sensitive and I do not have a demo database but I will try to give you the idea. I have blanked out some elements for privacy reasons.

For navigation I ended up using a mixture of command buttons and a command bar with a web page feel.

We created “Data Views” that are basically Pivot Tables. I believe that most users find it difficult to navigate Pivot Tables (adding new fields and moving them around) so the “Data Views” are specified based on function (eg Marketing people are interested in products with measures like sales value and gross profit whilst Sales people are more interested salesmen, customers … and often want to work with sales units not sales value). The “Graphic Representations” are Pivot Charts.

The big difference with these pivot tables and charts is that they are based on a dataset of millions of records and each field can contain a hierarchy of fields.

Excel User Interface

A Data View example (A very simple one)

Simple Data View Example

Graphical Representation example

Graphical Representation

OWC Pivot Chart example

For any description you find in a pivot table (a product, a customer , a salesman ….) you can select it and click the Analyse button. A form appears with a series of Office Web Component Pivot Charts based on your selection (active cell).

OWC Pivot Charts

All the charts are dynamic and based on the same item (in this case, a particular salesman). A double click on the OWC chart brings it up to fill the form and another double click pushes it back. One great feature of the OWC Pivot Chart is that it has “drill down” so you can select an item and drill into the details depending on how you set up the series.