Tuesday, November 10, 2015

Microsoft to offer three new ways to store big data on Azure

Microsoft will soon offer three additional ways for enterprises to store data on Azure, making the cloud computing platform more supportive of big data analysis.

Azure will have a data warehouse service, a "data lake" service storing large amounts of data, and an option for running "elastic" databases that can store sets of data that vary greatly in size, explained Scott Guthrie, Microsoft executive vice president of the cloud and enterprise group, who unveiled these new services at the company's Build 2015 developer conference, held this week in San Francisco.

The Azure SQL Data Warehouse, available later this year, will give organizations a way to store petabytes of data so it can be easily ingested by data analysis software, such as the company's Power BI tool for data visualization, the Azure Data Factory for data orchestration, or the Azure Machine Learning service.

Unlike traditional in-house data warehouse systems, this cloud service can quickly be adjusted to fit the amount of data that actually needs to be stored, Guthrie said. Users can also specify the exact amount of processing power they'll need to analyze the data. The service builds on the massively parallel processing architecture that Microsoft developed for its SQL Server database.

The Azure Data Lake has been designed for those organizations that need to store very large amounts of data, so it can be processed by Hadoop and other "big data" analysis platforms. This service could be most useful for Internet of Things-based systems that may amass large amounts of sensor data.

"It allows you to store literally an infinite amount of data, and it allows you to keep data in its original form," Guthrie said. The Data Lake uses Hadoop Distributed File System (HDFS), so it can be deployed by Hadoop or other big data analysis systems.

A preview of the Azure Data Lake will be available later this year.

In addition to these two new products, the company has also updated its Azure SQL Database service so customers can pool their Azure cloud databases to reduce storage costs andprepare for bursts of database activity.

"It allows you to manage lots of databases at lower cost," Guthrie said. "You can maintain completely isolated databases, but allows you to aggregate all of the resources necessary to run those databases."

The new service would be particularly useful for running public-facing software services, where the amount of database storage needed can greatly fluctuate. Today, most Software-as-a-Service (SaaS) offerings must over-provision their databases to accommodate the potential peak demand, which can be financially wasteful. The elastic option allows an organization to pool the available storage space for all of its databases in such a way that if one database rapidly grows, it can pull unused space from other databases.

The new elastic pooling feature is now available in preview mode.

Microsoft Azure's new Data Lake architecture.

Wednesday, May 21, 2014

Introduction to Business Intelligence Markup Language (BIML) for SSIS

Problem


With Microsoft SQL Server Integration Services (SSIS), you can build powerful and flexible packages to manage your enterprise-wide ETL solutions. However, every ETL project has some repetitive tasks: you need to import 20 different flat file exports from the ERP system, you need to load 15 dimensions into the data warehouse and so on. With SSIS out-of-the-box, you are forced to create multiple individual packages which is very time consuming due to the lack of easy code reuse. In a typical BI project, the ETL implementation can easily take up 60% of the project time, so it is beneficial if the development time of similar SSIS packages can be cut down.

Solution


BIML, which stands for Business Intelligence Markup Language, offers us a solution. With BIML you can easily generate SSIS packages based on metadata, allowing us to effectively apply code reuse, templates and patterns in our ETL solutions. This tip will give a brief introduction to BIML with a small example, but later tips will delve into the powerful capabilities of BIML.

What is BIML?


BIML is a dialect of XML and can be used to specify business intelligence and data warehouse solutions. It is 100% compatible with SSIS and SSAS. When you generate for example an SSIS package, you can further edit it in Visual Studio, as if you created the package manually.

The language is developed by Varigence, but an open-source implementation is available in BIDS Helper, a free add-on for Visual Studio. This means you can easily develop a BIML script inside BIDS and generate SSIS packages directly into your project.

BIML supports SSIS 2005, 2008(R2) and 2012, although not every feature of the project deployment model is currently implemented in BIDS Helper.

How do I start using BIML?


As mentioned in the previous paragraph, you simply install BIDS Helper on your machine, if you haven't already. To add a new BIML file to your project, simply right-click on the project or the SSIS packages node in the solution explorer. In the context menu, choose Add New BIML File.
A file called BIMLScript.biml will be added to the Miscellaneous folder. This BIML file has an empty BIML root element to start with.
As you can see in the screenshot above, Visual Studio offers color coding for the BIML, as it is just XML. BIDS Helper also offers some basic intellisence features:
When you type "<", you'll get a list of all the possible child elements for the current element. BIDS Helper will insert the corresponding closing tag when you close an element. When typing inside an element, you'll get a list of the attributes you can configure.
The intellisense reports XML violations - such as not closing an element - but also when you forget to configure required attributes.

The BIML can be checked for errors by right-clicking the file and choosing Check BIML for Errors. The compiler doesn't always give the most straight forward error messages, so you might want the check for errors often when developing large BIML files.

When you choose Generate SSIS Packages, BIML will check for errors by validating the XML and BIML syntax, but also by validating the resulting SSIS packages. For example, if a destination table used in a data flow does not already exist, an error will be thrown. Of course, packages are not generated when any error is found during compilation.


When packages have been generated successfully, they will be added to the SSIS Packages node in the project.

A small BIML example


Let's take a look at a small sample BIML file to get a feeling on how BIML works. This example will create a SSIS package with two Execute SQL Tasks. The first task will create a table if it does not exist yet, the second task will insert a row in the same table.

First we need to create the connection managers. The following code creates an OLE DB connection manager called OLE_BIML to the database named BIML on the localhost server.
========================================================================
<Biml xmlns="http://schemas.varigence.com/biml.xsd">
  <Connections >
    <Connection Name ="OLE_BIML"
                ConnectionString="Data Source=.;Initial Catalog=BIML;
                   Provider=SQLNCLI11.1;Integrated Security=SSPI;Auto Translate=False;"/>
  </Connections>
========================================================================
Connections are defined outside any package. Once you refer to a connection in a package, it is added to this package when it is generated by BIML. The next step is to add a package to the BIML script. Small remark: if you are working with SSIS 2005 or 2008, you might want to change the version number of the native client in the connection string. For example, SQLNCLI10.1 for SSIS 2008.

The following snippet will create an empty package called HelloWorld.
========================================================================
<Packages>
    <Package Name="HelloWorld" ConstraintMode="Linear">
    </Package>
</Packages>
========================================================================
The constraint mode defines the default behavior of tasks inside the package. When Parallel is specified, no precedence constraints are defined between tasks. With the Linear constraint mode, each task is connected with a Success precedence constraint to the previous task, following the order of specification in the BIML file.

Let's add the two Execute SQL Tasks to our package. This code will add the first Execute SQL Task, which will check the existence of the destination table and create it if necessary.
========================================================================
<Tasks>
   <ExecuteSQL Name="(SQL) Create or Drop Table" ConnectionName="OLE_BIML">
      <DirectInput>
            IF EXISTS (SELECT * FROM sys.objects WHERE object_id = 
                       OBJECT_ID(N'[dbo].[HelloWorld]') AND type IN (N'U'))
                   DROP TABLE [dbo].[HelloWorld];
            GO
            IF NOT EXISTS (SELECT * FROM sys.objects WHERE object_id = 
                       OBJECT_ID(N'[dbo].[HelloWorld]') AND type IN (N'U'))
            BEGIN
                   CREATE TABLE [dbo].[HelloWorld](
                          [ID] [int] IDENTITY(1,1) NOT NULL,
                          [Message] [varchar](50) NOT NULL,
                          [Timestamp] [datetime2](3) NOT NULL
                   );
            END
            GO
      </DirectInput>
   </ExecuteSQL>
========================================================================
The Execute SQL Task has two attributes defined: its name and the connection it will use. The task has a child element DirectInput, specifying the query issued against the database. When specifying a query, watch out for special XML characters, such as double quote or the less than sign. You need to replace these with their escaped counterparts. For example, ampersand & becomes &amp;.

The second Execute SQL Task is created with similar code:
========================================================================
<ExecuteSQL Name="(SQL) Insert Hello World" ConnectionName="OLE_BIML">
      <DirectInput>
     INSERT INTO [dbo].[HelloWorld] VALUES ('Hello World!',SYSDATETIME());
      </DirectInput>
   </ExecuteSQL>
</Tasks>
========================================================================
And that's all the code we need to generate our package. The full script - with comments added - looks like this:
========================================================================
<!-- BIML script created by Koen Verbeeck -->
<!-- This BIML code will generate a simple SSIS package with two Execute SQL Tasks -->
<Biml xmlns="http://schemas.varigence.com/biml.xsd">
  <!-- First define the connection. This is defined outside the packages.
       Add CreateInProject="true" for project connection managers.-->
  <Connections >
    <Connection Name ="OLE_BIML"
                ConnectionString="Data Source=.;Initial Catalog=BIML;
                Provider=SQLNCLI11.1;Integrated Security=SSPI;Auto Translate=False;"/>
  </Connections>
  <Packages>
    <!-- Linear automatically connects all tasks in order of creation.
         Parallel doesn't create precedence constraints. -->
    <!-- Using SSIS 2012, projects also have protection levels.
         Those of the packages must match the protection level of the project. -->
    <!-- DontSaveSensitive is the default package protection level (as it should be). -->
    <Package Name="HelloWorld" ConstraintMode="Linear">
      <Tasks>
        <!-- When a connection is used in a task, the connection manager is added
             to the package. -->
        <ExecuteSQL Name="(SQL) Create or Drop Table" ConnectionName="OLE_BIML">
          <DirectInput>
            IF EXISTS (SELECT * FROM sys.objects WHERE object_id = 
                           OBJECT_ID(N'[dbo].[HelloWorld]') AND type IN (N'U'))
                   DROP TABLE [dbo].[HelloWorld];
            GO
            IF NOT EXISTS (SELECT * FROM sys.objects WHERE object_id =
                            OBJECT_ID(N'[dbo].[HelloWorld]') AND type IN (N'U'))
            BEGIN
                   CREATE TABLE [dbo].[HelloWorld](
                          [ID] [int] IDENTITY(1,1) NOT NULL,
                          [Message] [varchar](50) NOT NULL,
                          [Timestamp] [datetime2](3) NOT NULL
                   );
            END
            GO
          </DirectInput>
        </ExecuteSQL>
        <ExecuteSQL Name="(SQL) Insert Hello World" ConnectionName="OLE_BIML">
          <DirectInput>
            INSERT INTO [dbo].[HelloWorld] VALUES ('Hello World!',SYSDATETIME());
          </DirectInput>
        </ExecuteSQL>
      </Tasks>
    </Package>
  </Packages>
</Biml>
========================================================================
After generating the package, we can admire the result:
The package can now be edited like any other SSIS package. You can validate the package by running it and checking the result in SQL Server.

Conclusion


This tip gave a basic introduction to the language BIML, which you can use to generate SSIS packages. Next tips will dig deeper into BIML to expose its powerful capabilities and will show you how you can use metadata to generate SSIS packages on the fly.

Try BIML out for yourself. You can use the code used in this tip to create your own simple BIML script. The script uses a database called BIML, so you either can create such a database or change the connection string to work with another database.


Thursday, May 15, 2014

Business Intelligence Enhancements in SQL Server 2014



Microsoft has been pouring R&D resources into building out its business intelligence (BI) feature set and the upcoming SQL Server 2014 (SQL 2014) release will continue that trend. The new release includes enhancements to make data exploration easier, improvements in BI semantic modeling, new offerings to help build and support massive databases and data warehouses, and tools to ensure the quality and consistency of data. Here’s what SQL 2014 will do for you with regards to business intelligence:

  1. Data Exploration Enhancements in SQL Server 2014: Microsoft has more than 300 million users who think of Excel when they think about manipulating data. It only makes sense to reinforce the relationship between Excel and to a wider extent Microsoft Office, as a front end for data manipulation, exploration, and visualization against a SQL Server back end all through the rich and familiar front end of Excel. Microsoft’s new PowerPivot add-in for Excel makes accessing and analyzing data very easy for end users. The new Power View browser-based add-in for Excel adds new, powerful means of visualizing data, wherever it resides. Other new tools include Power Map (formerly known as Project GeoFlow) and Project Data Explorer, for better mapping and geographic data integration and data import into Excel for heterogeneous data sources, respectively.
  2. BI Semantic Model in SQL Server 2014: Microsoft’s improvements in their BI Semantic Modeling (BISM) enables users new ways to build out BI solutions the scale from small, single-person usage to huge Fortune 500 organizations, focusing on credible and consistent data.
  3. Big Data: Big data gets even easier in SQL2014. There are lots of new offerings to help build and support massive databases and data warehouses, such as scaling up to 15k partitions in a data store and up to 640 logical cores on high-end database servers. In addition, Microsoft has fully embraced Hadoop in the form of HDInsight, on Windows Azure and Windows Server, to take advantage of unstructured data and the parallel computational approach common to Hadoop applications. PolyBase, also new in SQL2014, is a feature of the SQL Server Parallel Data Warehouse (PDW) which makes combining nonrelational data and traditional relation data an easy and swift process.
  4. Enterprise Information Management (EIM): Enterprises need help controlling the spread of data silos and ensuring the quality and consistency of data. Microsoft has introduced or enhanced several tools to serve this requirement. The Data Quality Services (DQS) tools help enterprises and data stewards manage end-to-end data management by building a knowledge base of data-quality topics. Master Data Services (MDS) adds new features, such as an MDS add-in for Excel, to map objects, reference data, and control dimensions and hierarchies of data.

That’s a lot of new features and tools for the data analysis and insight side of business intelligence. The focus, in my opinion, is on enabling end users to gather, analyze, and maintain their own business intelligence.

Tuesday, May 13, 2014

Get data from a Web site with Excel, Power Query and Kimono

You surely aware about it, Power Query is an Excel add-in published by Microsoft via its Power BI new technology, dedicated to the importation of data from multiple sources.


If we are numerous to see in Power Query a tool with an exceptional potential, there is a field where it is still rather weak: the importation of data displayed on a Web page. Finally it is not really Power Query which is weak, it is rather than after 2500 lines of Javascript, the final HTML of the sites is often completely not exploitable…


To surmount the problem there is a fantastic tool called kimono. With kimono, you will generate in a completely graphic way, an API from a Web site directly from of your browser.
To understand the subject we can take an example, data since MetaCritic Web Site will be used, the Web site which incorporates the notes given in the press (paper or Web) about the video games.

We will start with trying to connect directly to MetaCritic via Power Query, to note the inutilisability of the thing:

1 – I launch Excel, Power Query Menu, option Import from Web, then I write the address of the site source:


2 – Oups, the browser of Power Query does not leave himself there all alone, we will Edit it: 

3 – And boom, good luck! If somebody found a method, I am taking, me I give up there in general…

So let’s connect to kimono (after the subscribtion, we add the bookmarklet to its bar of shortcut) and it will create the API with a graphic way, let’s start!

1 – I open my browser, I go on the site source and I use kimonify (the bookmarklet). The tool bar Kimono appears at the top of the page, and I start by importing the names of the games right while clicking on them in the page of MetaCritic. The engine of Kimono recognizes attribute HTML then and identifies the 88 names, I rename the field “Name of the game”:

2 – I press on + in the bar Kimono, I select the first note (92 for DS2), the second (88 for Hearthstone), and again Kimono identifies the 88 following values. I rename the field “Notes”:

3 – I repeat the manipulation with the Score User, this time the engine of Kimono hesitates a little, it proposes several series to me and by select other scores and to refuse other attributes (as well as the label “To use Score”), it finds me 85 (indeed 3 games do not have this score, but that does not break the recognition):

4 – I have enough information for the moment, I validate (Done in the bar Kimono), I give a name to my API, and a period of refreshing (real time for the demonstration, but one can reduce the load on the source in refreshing the data set only periodically):

5 – And Kimono returns me to the dashboard of my API:

6 – Via the menu “How To Use” I find the necessary elements to reach the API, including the URL (endpoints) which I will be able to transmit to Power Query in JSON, CS or RSS:

7 – We can test EndPoint CSV (yeah I am oldschool) immediately:

8 – But it’s better to call it directly from Power Query (From Web always):

9 – And after some small handling (we delete the first line, we separate the columns by delimiter, we use the first line like a name of column, we clean User Score, we filter the lines of waste), we obtain the good dataset. Note: if you choose the JSON and It’s not working, it is that the API is badly formed, best is to break it and start again (Kimono is still in beta).

10 – to obtain the expected data in Excel:

Personally I find that just enormous! And it’s all in live, Query in Excel and the API side Kimono, therefore it is enough to refresh to get the updated data from the source.

Is Microsoft Power BI a Game Changer?

At its 2013 Worldwide Partner Conference (WPC) Microsoft made its latest Business Intelligence (BI) product introduction, Power BI. 
Adding more BI functionality to the Excel 2013 client application is certainly anticipated. We’ve already seen enterprise features previously delivered in Microsoft’s server-based products re-envisioned as self-service, Excel features. The addition of Power Query and Power Map to Excel marks continued progress toward completing the BI tool set within Excel.
But Power BI takes Microsoft's popular Excel-based self-service BI features to a whole new level by reconnecting these features to enteprise BI/DW data and adding a compelling search and data exploration experience.  It's truly innovative, and will potentially redefine self-service Business Intelligence.

What is Power BI?


The key functionality delivered by Power BI includes a number of integrated components; all stitched together using Office 365:
Power Pivot.  Power Pivot provides end-user accessible, in-memory data modeling for large data sets. Power Pivot was introduced as an add-in to Excel 2010, and since has been fully integrated as an out-of-the-box feature in Excel 2013.  
Power View. Power View is a complimentary technology to Power Pivot, enabling advanced visualizations for data models created in Power Pivot. Power View delivers interactive visualizations, including animated visuals and maps powered by Bing Maps. Originally Power View was available only as a SharePoint feature, but has since been integrated directly into Excel 2013.
Power Map. Power Map, previously known by the development name GeoFlow, is an add-in to Excel 2013 that provides still more compelling Bing Map powered visualizations, extending Power View’s capabilities with 3D map visualizations.
Mobile BI Viewer. Microsoft announced (but has not fully detailed) native mobile viewing applications for visualizations created in Power View and published to Power BI sites. The mobile components will meet the growing requirement to deliver BI visualizations to tablet devices.
Power Query. Power Query, previously known by the development name Data Explorer, is an add-in to Excel 2013 that provides a more fluid, open data discovery environment than is provided by Power Pivot alone.  Despite its unexciting name, we believe Power Query, together with its catalog and on-premises data integration, will do more to enable innovative self-service BI solutions than any end-user technology we’ve seen to-date.
Data Management Gateway. While the publishing and consumption experience for Power BI is in the cloud, Microsoft has recognized that much of the underlying data users will explore will still be on premises.  Power BI's architecture includes a Data Management Gateway facility to allow users to discover, explore and incorporate on premises data within the cloud publishing infrastructure.
Q&A. Q&A is the most Exciting of all the end-user technologies announced at WPC—by far. Just as Google’s search engine fundamentally changed the way the world discovers and interacts with content on the Internet, Q&A will completely change how users discover and navigate their organization’s rapidly growing, yet often abstruse, mountain of data. 

Hybrid Architecture


Power BI will be delivered as a hybrid of on-premise and cloud components.  Consistent with Microsoft’s Office-centric BI strategy, data modeling and solution authoring will be done using Excel 2013 on Windows desktops. Publishing will be done using Office 365’s SharePoint service—requiring end-users to have only a web browser (or a tablet device for native mobile apps).
This hybrid architecture is a huge win with BI content authors who are comfortable with Excel and value the advanced functionality only a full-up desktop application delivers. But it may be less welcomed by organizations that standardize on “web browser only” architectures, or those that need to deliver full functionality to non-Windows users.

Is a cloud-only product a good idea?


Most controversial will be the decision to deliver Power BI exclusively via Microsoft’s Cloud Office platform, Office 365.  Few would argue that cloud platforms are an accepted macro trend in corporate IT—case in point enterprise adoption of Salesforce.com, Google Docs, AWS, Dynamics CRM, Windows Azure, and Office 365.  As detailed in Microsoft’s most recent quarterly earnings report, cloud-delivered solutions are growing much faster than its traditional products.
Yet every cloud provider offering a cloud alternative to on-premises software initially encounters resistance from business customers more accustomed to solutions deployed within their own computing environments. Power BI will have its skeptics.
Power BI will be a quick win with customers who need “big company BI” without the capital investment and complexity that accompanies it. Other customers will initially resist moving their BI systems to Microsoft’s cloud for a variety of reasons. 
While Power BI’s SaaS strategy will postpone adoption for some customers, it’s the right one for Microsoft and its customers. Why?
  • Reduced Complexity. Power BI will be compelling primarily because of the array of integrated functionality it provides. Provisioning such a comprehensive system on-premises would take most customers weeks or months. As with Office 365 and Windows Azure, Power BI instances will deploy in days, not weeks, and future infrastructure upgrades will be left to Microsoft, freeing in-house resources to worry only about the application layer.
  • Pace of Evolution. The BI/Data Analytics space is rapidly changing. Mobility, Big Data and traditional Business Intelligence are quickly merging to create a new kind of insight & analytics landscape.  Cloud-based systems can iterate and improve faster than traditional on-premises alternatives.  Ultimately, adopting a cloud-deployed system will benefit customers by providing more modern, up-to-date systems that keep up with their evolving BI/Analytics needs.
  • Broadened Adoption. By removing the need to make large up-front infrastructure investments, Power BI will make the decision to use the technology much easier for more customers--ultimately broadening adoption of advanced BI capabilities.

How will Power BI change the game?


Through its data catalog and Data Management Gateway, Power BI merges centralized DW/BI strategies with cloud data sources and self-service tools like no other platform has yet been able to do. And as a SaaS solution, all of its functionality packaged so that any organization can easily implement and use it.  In short, Power BI has the potential to accelerate the adoption curve and make advanced BI Analytics as common as Excel itself.
Power BI brings game-changing capabilities to all BI system stakeholders:
  • End Users. End-users have long wanted the ability to easily access and analyze data.  Q&A’s natural language query and intelligent visualization engine will do more to encourage every employee to use information as a basis for decisions. 
  • BI Analysts. Power Pivot gave power users the ability to build efficient data models over large data volumes, but that information needed to “fit” into clean data models.  Power Query’s data transformation and flexibility further empower analysts to build their own end-to-end solutions quickly.
  • IT & Data Stewards. Self-service BI has always been a push-pull topic.  Everyone wants users to have the tools and information they need to make good decisions.  Yet proper governance, security and auditing is often seen as a road-block to progress by power users.  The data catalog underlying Power BI is exciting because it promotes self-service and governance simultaneously.
Not every organization will move quickly to adopt Power BI. For some, it will represent too much strategic change too quickly.  For others there may truly be regulatory or other reasons that preclude consideration of cloud-based systems. 
But for many organizations, establishing an affordable and robust BI platform has been challenging and cost-prohibitive.  Power BI may be the solution that finally makes advanced BI as simple to use as a search engine, and as broadly adopted.

Monday, May 12, 2014

Microsoft’s Power BI For Office 365 Comes Out Of Preview, Simplifies Data Analysis And Visualizations

After more than half a year in limited preview, Microsoft today launched Power BI for Office 365, its Business Intelligence for Office 365 Enterprise subscribers. In combination with Excel, Power BI allows users to model and analyze their data and query large datasets with complex natural-language queries. The tool also allows users to easily visualize their data in Excel with the help of Power View and Power Map.


As Eron Kelly, Microsoft’s general manager for SQL Server product marketing, most business users feel so comfortable using Excel that the company decided not to build a new dashboard for Power BI. “We want to make use of the ubiquity of Excel,” he said, and Power BI will allow far more users access to critical business intelligence tools than ever before.
Instead of having to go to IT and ask for data, wait for it, and then probably ask IT to slice the data in a different way again, tools like Power BI allow for self-service business intelligence. The barrier to entry is relatively low, and as the volume of available data increases and companies look to make their processes more data-driven, Microsoft hopes that Power BI will become the go-to choice for enterprises.

Despite the self-service model, though, IT still remains in control of what data is shared. Microsoft allows IT admins to set up data sources for their users, but what makes the tool especially powerful is that this data can be mixed with publicly available data, too. That can come from Microsoft’s Azure Marketplace, but for advanced users, the tool will also allow you to query any REST API or service that uses OData, too.


The stand-out feature of Power BI, however, is its Q&A tool. This allows you to ask a question about your data in natural language. Say you are looking at public data about New York you’ve pulled in from the city’s open APIs. With Power BI, you can now ask for “noise complaints on Superbowl Sunday compared to an average day” and the tool will happily give you the answer. The technology behind this feature is based on a combination of work done by Microsoft Research and the Bing team.

As a part of Office 365, Power BI is only available through a subscription and only available as an add-on to Office 365 Enterprise customers.

MSBI: Basics of SSIS and Creating Packages #1

In this article, we will see what a SQL Server Integration Services (SSIS) is; basics on what SSIS is used for, how to create an SSIS Package and how to debug the same.

SSIS AND DTS OVERVIEW


SSIS is an ETL tool (Extract, Transform and Load) which is very much needed for the Data warehousing applications. Also SSIS is used to perform operations like loading the data based on the need, performing different transformations on the data like doing calculations (Sum, Average, etc.) and to define a workflow of how the process should flow and perform some tasks on the day to day activity.

Prior to SSIS, Data Transformation Services (DTS) in SQL Server 2000 performs the tasks with fewer features. With the introduction of SSIS in SQL Server 2005, many new features can be used. To develop your SSIS package, you need to get installed with the SQL Server Business Intelligence Development Studio which will be available as client tool when installing SQL Server Management Studio (SSMS).

SSMS AND BIDS


SSMS provides different options to develop your SSIS package starting with Import and Export wizard with which you can copy the data from one server to the other or from one data source to the other. With these wizards, we can create a structure on how the data flow should happen and make a package and deploy it based on our needs to execute in any environment.

Business Intelligence Development Studio (BIDS) is a tool which can be used to develop the SSIS packages. BIDS is available with SQL Server as an interface which provides the developers to work on the work flow of the process that can be made step by step. Once the BIDS is installed with the SQL Server installation, we can locate it and start our process as shown in the steps below.

STEPS


Now let’s take an example of importing data from a text file to the SQL Server database using SSIS. Let’s have the step by step process of how to achieve this task using SSIS.

Step 1 – Go to Start? Programs? Microsoft SQL Server 2008 R2? SQL Server Business Intelligence Development Studio as shown in the below figure:

It will open the BIDS as shown in the below figure. This will similar to the Visual Studio IDE where we normally do the startup projects based on our requirements.

Step 2 – Once the BID studio is open, now we need to create a solution based on our requirement. Since we are going to start with the integration services just move on to File -> New Project or Ctrl + Shift + N. It will open a pop up where we need to select Integration Services Project and give the project name as shown in the below screen:

After creating this new project, we can see that by default a new SSIS package is added (Package.dtsx) when you create an Integration Services Project; you can right click on it and rename it. Now the designer will show the empty workflow tabs as shown below:

Step 3 – Since our task is to load the text file into the database, we need to create a new connection or otherwise in simple words we need to create a data source. Right click on the connection manager pane and select "New Flat File Connection" as shown in the below image:

Step 4 – It will open a window as shown in the below image. We need to fill in the details like the path of the text file, and Format of the text file. If the text file first row contains the columns, then we need to select the "Column names in the first data row", check box as shown in the below image:

Step 5 – You can see the columns which we are trying to import from the text file by going to the columns tab or the preview tab in the below image. In the preview tab, we can see the data as per our requirement like say a huge amount of data is available in the source like 1 million records. In the preview tab, it shows only 100 records at a time. To see the records from 500 to 600, you need to specify "Data rows to skip" = 500 in the preview tab. So we can see the records from 501 to 600. Click on the OK button to complete this task.

Step 6 – Now we are done with the source section defined, we need to the similar steps to make the destination section. Now right click on the connection manager pane and select "New OLEDB Connection". It will pop up a window to get the connection details as shown below. Fill in all the connection details and test the connection.

After clicking on OK, we will see the connection details in the connection manager as shown in the below screen.

Step 7 – Now we need to create the data flow task in order to achieve our requirement to import the data from the text file to the database. SSIS uses the work flow tasks in order to process the request as a step by step process. Most of our tasks will be completed here with writing coding to make it done. So in the tool box at the left pane, click on the "Data Flow Task" item, and drag it to the screen. You’ll see an icon as shown in the below screen. Double-click on that icon to open it.

Step 8 – Double click on the Data Flow Task, it will redirect to the Data Flow tab where we need to place our logic how the data flows as per our requirement. Since our requirement is to import the text file Drag and drop the "Flat File Source" from the tool box to the data flow tab screen. Again double click on the Flat File Source task it will pop up a window which has the connection which we set up at the initial stage as shown in the below screen:


If you could notice there is an option "Retain null values from the source as null values in the data flow." This is disabled by default, which means that null values in the source defined will be treated as empty strings. You can click on the Preview button to double-check your structure.

Step 9 – Now, we need to set the destination flow, i.e., till now we read the data and now we need to insert it to the database as per our requirement. Go to the tool box and click on the data flow destination section and drag and drop the "OLE DB Destination". Now, we need to map the source and the destination in order to tell SSIS how the map works. So, click on the Flat file Source and drag the green arrow to the Ole DB destination as shown in the below screen:

Step 10 – Now click on the destination task to configure it. Double click on the Destination task to configure it. Notice that this also allows you to keep nulls. In our example, we’re going to create a new table, but if you have a table already created, you can select it from the table drop-down menu available. I have created a new table in my Database with the below structure before doing the above steps. So, I can see the table name in the list as shown in the below screen:



Step 11 – Now we need to do the mapping, i.e., we are going to map the columns from the Source (Flat File Schema) to the destination (DB Schema) as shown in the below screen:

Step 12 – Finally, once everything is configure click on OK button. You should see the screen similar to the one below.

Step 13 – Press F5 to start the execution of the SSIS package. We should see each step turn green as the data is read and written. And you can see the data that has been imported from the Text file to the Database.

We will go over the different sections of SSIS in the upcoming articles :)!  See you guys!