Quantcast
Channel: Microsoft and SQL*Server – DBMS 2 : DataBase Management System Services
Viewing all articles
Browse latest Browse all 43

Hortonworks, Hadoop, Stinger and Hive

$
0
0

I chatted yesterday with the Hortonworks gang. The main subject was Hortonworks’ approach to SQL-on-Hadoop — commonly called Stinger —  but at my request we cycled through a bunch of other topics as well. Company-specific notes include:

  • Hortonworks founder J. Eric “Eric14″ Baldeschwieier is no longer at Hortonworks, although I imagine he stays closely in touch. What he’s doing next is unspecified, except by the general phrase “his own thing”. (Derrick Harris has more on Eric’s departure.)
  • John Kreisa still is at Hortonworks, just not as marketing VP. Think instead of partnerships and projects.
  • ~250 employees.
  • ~70-75 subscription customers.

Our deployment and use case discussions were a little confused, because a key part of Hortonworks’ strategy is to support and encourage the idea of combining use cases and workloads on a single cluster. But I did hear:

  • 10ish nodes for a typical starting cluster.
  • 100ish nodes for a typical “data lake” committed adoption.
  • Teradata UDA (Unified Data Architecture)* customers sometimes (typically?) jumping straight to a data lake scenario.
  • A few users in the 10s of 1000s of nodes. (Obviously Yahoo is one.)
  • HBase used in >50% of installations.
  • Hive probably even more than that.
  • Hortonworks is seeing a fair amount of interest in Windows Hadoop deployments.

*By the way — Teradata seems serious about pushing the UDA as a core message.

Ecosystem notes, in Hortonworks’ perception, included:

  • Cloudera is obviously Hortonworks’ biggest distro competitor. Next is IBM, presumably in its blue-forever installed base. MapR is barely on the radar screen; Pivotal’s likely rise hasn’t yet hit sales reports.
  • Hortonworks evidently sees a lot of MicroStrategy and Tableau, and some Platfora and Datameer, the latter two at around the same level of interest.
  • Accumulo is a big deal in the Federal government, and has gotten a few health care wins as well. Its success is all about security. (Note: That’s all consistent with what I hear elsewhere.)

I also asked specifically about OpenStack. Hortonworks is a member of the OpenStack project, contributes nontrivially to Swift and other subprojects, and sees Rackspace as an important partner. But despite all that, I think strong Hadoop/OpenStack integration is something for the indefinite future.

Hortonworks’ views about Hadoop 2.0 start from the premise that its goal is to support running a multitude of workloads on a single cluster. (See, for example, what I previously posted about Tez and YARN.) Timing notes for Hadoop 2.0 include:

  • It’s been in preview/release candidate/commercial beta mode for weeks.
  • Q3 is the goal; H2 is the emphatic goal.
  • Yahoo’s been in production with YARN >8 months, and has no MapReduce 1 clusters left. (Yahoo has >35,000 Hadoop nodes.)
  • The last months of delays have been mainly about sprucing up various APIs and protocols, which may need to serve for a similar multi-year period as Hadoop 1′s have. But there also was some YARN stabilization into May.

Frankly, I think Cloudera’s earlier and necessarily incremental Hadoop 2 rollout was a better choice than Hortonworks’ later big bang, even though the core-mission aspect of Hadoop 2.0 is what was least ready. HDFS (Hadoop Distributed File System) performance, NameNode failover and so on were well worth having, and it’s more than a year between Cloudera starting supporting them and when Hortonworks is offering Hadoop 2.0.

Hortonworks’ approach to doing SQL-on-Hadoop can be summarized simply as “Make Hive into as good an analytic RDBMS as possible, all in open source”. Key elements include: 

  • Providing a Hive-friendly execution environment in Hadoop 2.0. For example, this seems to be a main point of Tez, although Tez is also meant to support Pig and so on as well. (Recall the close relationship between Hortonworks and Pig fan Yahoo.)
  • Providing a Hive-friendly HDFS file format, called ORC. To a first approximation, ORC sounds a lot like Cloudera Impala’s preferred format Parquet.
  • Improving Hive itself, notably in:
    • SQL functionality.
    • Query planning and optimization.
    • Vectorized execution (Microsoft seems to be helping significantly with that).

Specific notes include:

  • Some of the Hive improvements — e.g. SQL windowing, better query planning over MapReduce 1 — came out in May.
  • Others — e.g. Tez port — seem to be coming soon.
  • Yet others — notably a true cost-based optimizer — haven’t even been designed yet.
  • Hive apparently often takes 4-5 seconds to plan a query, with a lot of the problem being slowness in the metadata store. (I hope that that’s already improved in HCatalog, but I didn’t think to ask.) Hortonworks thinks 100 milliseconds would be a better number.
  • Other SQL functionality that got mentioned was UDFs (User Defined Functions) and sub-queries. In general, it sounds as if the Hive community is determined to someday falsify the “Hive supports a distressingly small subset of SQL” complaint.

As for ORC:

  • ORC manages data in 256 megabyte chunks of rows. Within such chunks, ORC is columnar.
  • Hortonworks asserts that ORC is ahead of Parquet in such areas and indexing and predicate pushdown, and only admits a Parquet advantage in one area — the performance advantages of being written in C.
  • The major contributors to ORC are Hortonworks, Microsoft, and Facebook. There are ~10 contributors in all.
  • ORC has a 2-tiered compression story.
    • “Lightweight” type-specific compression is mandatory, for example:
      • Dictionary/tokenization, for single columns within chunks.
      • Run-length encoding for integers.
    • Block-level compression on top of that is optional, via a collection of usual-suspect algorithms.

Finally, I asked Hortonworks what it sees as a typical or default Hadoop node these days. Happily, the answers seemed like straightforward upgrades to what Cloudera said in October, 2012. Specifics included:

  • 2 x 6 = 12 cores.
  • 12 or so disks, usually 2-3 terabytes each. 4 TB disks are beginning to show up in “outlier” cases.
  • Usually 72 gigs or more of RAM. 128 gigs is fairly common. 256 sometimes happens.
  • 10GigE is showing up at some web companies, but Hortonworks groaned a bit about the expense. Hearing that, I didn’t even ask about Infiniband, its use in certain Hadoop appliances notwithstanding.
  • Hortonworks isn’t seeing much solid-state drive adoption yet, some NameNodes excepted. No doubt that’s a cost issue.
  • Hortonworks sees GPUs only for “outlier” cases.

Related links


Viewing all articles
Browse latest Browse all 43

Trending Articles