Notes on Microsoft SQL Server

I’ve been known to gripe that covering big companies such as Microsoft is hard. Still, Doug Leland of Microsoft’s SQL Server team checked in for phone calls in August and again today, and I think I got enough to be worth writing about, albeit at a survey level only,

Subjects I’ll mention include:

Hadoop
Parallel Data Warehouse
PolyBase
Columnar data management
In-memory data management (Hekaton)

One topic I can’t yet comment about is MOLAP/ROLAP, which is a pity; if anybody can refute my claim that ROLAP trumps MOLAP, it’s either Microsoft or Oracle.

Microsoft’s slides mentioned Yahoo refining a 6 petabyte Hadoop cluster into a 24 terabyte SQL Server “cube”, which was surprising in light of Yahoo’s history as an Oracle reference.

But first we need some housekeeping. As best I understood Microsoft’s lingo:

Microsoft talks about selling in three form factors, collectively “ABC”:
- A = Appliance, which is how PDW (Parallel Data Warehouse, nee’ DATAllegro) is sold, in partnership with either Dell or HP.
- B = Box, which is catchy word for “software”.
- C = Cloud*
Names of major releases go with years — SQL Server 2005, 2008, 2012.
- Timing on the next major SQL Server release hasn’t been disclosed yet …
- … but hopefully will be clarified in the first half of 2013.
- In the mean time, it’s safe to say that it’s a small number of years away, not a small number of quarters.
Point releases of SQL Server are called “Service Packs”, and Service Pack 1 for SQL Server 2012 is now generally available.
Public betas for Azure are called “preview”, and that lingo has slipped into other form factors as well.
Microsoft’s Hadoop efforts are called HDInsight, across at least the Box and Cloud form factors.

*I.e. Azure; pay no attention to dictionaries and poets, who say that skies are azure, while clouds are puffy white.

Microsoft’s Hadoop/HDInsight story starts with what you’d expect:

You can get it in the cloud or on-premises.
Hortonworks did a lot of the work.
Microsoft does Tier 1 support; Hortonworks does Tiers 2 & 3.

The first level of HDInsight management tools will be based on Ambari and donated back to Apache open source, but you might want to integrate the use of those with Microsoft’s long-standing proprietary management suites.

Notes on SQL Server Parallel Data Warehouse include:

PDW apparently has real reference customers and so on.
PDW now uses DAS (Direct-Attached Storage) and the like, versus a previous strategy of simulating shared-nothing on a SAN (Storage Area Network).

What sounds like it might be cool is PolyBase, a PDW extension comparable to Hadapt or Teradata Aster SQL-H. Notes on that start:

Amusingly, PolyBase was developed in the lab of famed MapReduce skeptic Dave DeWitt.
PolyBase development has been underway for around 18 months.
PolyBase will ship with the next release of PDW, scheduled for the first half of 2013.

Technically, I gather:

It has or is a “new query processor” for PDW.
HDFS (Hadoop Distributed File System) now will look like an external table to SQL Server.
SQL Server’s query planner/cost-based optimizer has the choice of either pulling data from HDFS into SQL Server, or kicking off MapReduce jobs straight in Hadoop.

I didn’t ask whether HDFS and SQL Server live on the same nodes, ala Hadapt, or different ones, ala Teradata Aster — but I’m guessing the latter, based on Microsoft’s PolyBase page.

And by the way — if SQL Server has significant analytic platform capabilities, nobody’s ever briefed me on them. To the extent it doesn’t, PolyBase/Hadoop might evolve into a partial substitute.

Microsoft SQL Server has for a while had a columnar capability, kludged from its indexing system. The big limitations were:

The column store was read-only.
You had to have a row-organized version of the data sitting around somewhere.

Both those restrictions are being lifted — initially just in PDW appliances, but later in the “box” products as well. Naturally, Microsoft reports that compression is great, calling it “10X” just like the other cool columnar kids now do. At one point there were hasty mentions of “vector processing” and something that sounds like Netezza zone maps, but I didn’t get details of either.

Actually, I suspect there’s a bit of kludge left in there somewhere, as the no-row-based-version feature is “optional”, and the column store is being described as a “clustered index”.

That takes us to Hekaton, which is already in “preview” with about 100 customers even though it won’t be generally available until the next major SQL Server release a few years out. As on other subjects, I lack detail, but I gather that Hekaton has some serious in-memory DBMS design features. Specifically mentioned were the absence of locking and latching.

A key point is that you only have to move some of your tables into Hekaton; you can manage the rest on disk as you always did. This may be regarded as somewhere in between storage tiering and full federation, in that SQL Server is one DBMS, but can invoke several very different storage engines.

And that’s all I have for now. Greater substance may or may not follow.

Related links

A vehement, multi-party debate on SAN versus DAS (2008)
DATAllegro’s one and only production customer (2009)
What Netezza zone maps and range partitioning evolved into (2010)
Andrew Brust’s post on PolyBase (last month)
Microsoft’s blog post on Hekaton (also last month)