Roger D. Hosto Jr.
Database, LAMP, LAPP, and LAMJ Professional
CMDBA, Network+, i-Net+
Managing Data Flow for Hadoop with Falcon. ~
May 23rd 2013 11:13:13|
There is a new project in the Apache incubator called Falcon.
Falcon is a data processing and management solution for Hadoop designed for data motion, coordination of data pipelines, lifecycle management, and data discovery. Falcon enables end consumers to quickly onboard their data and its associated processing and management tasks on Hadoop clusters.
Falcon will enable easy data management via declarative mechanism for Hadoop. Users of Falcon platform simply define infrastructure endpoints, data sets and processing rules declaratively. These declarative configurations are expressed in such a way that the dependencies between these configured entities are explicitly described. This information about inter-dependencies between various entities allows Falcon to orchestrate and manage various data management functions.
The key use cases that Falcon addresses are:
With these features it is possible for users to onboard their data sets with a comprehensive and holistic understanding of how, when and where their data is managed across its lifecycle. Complex functions such as retrying failures, identifying possible SLA breaches or automated handling of input data changes are now simple directives. All the administrative functions and user level functions are available via RESTful APIs. CLI is simply a wrapper over the RESTful APIs.
This seems to be a very interesting project with a lot potential. For more information check it out at http://wiki.apache.org/incubator/FalconProposal
Big Data could generate millions of new jobs ~ May 21st 2013 20:42:13
FORTUNE -- With data analytics now one of the fastest growing fields in IT, it stands to reason that data scientists are in demand. That's great for people with the requisite skills. The problem, according to Peter Sondergaard, a senior vice president at IT research firm Gartner, is that there aren't enough of them.
Hive Stringer Initiative. Hive Queries 100 Times Faster. ~ May 14th 2013 15:33:46
What is stringer you ask.. Stringer is an initiative to enable Hive human-time use cases (i.e. queries in the 5-30 second range) such as big data exploration, visualization, and parameterized reports without needing to resort to yet another tool to install, maintain and learn can deliver a lot of value to the large community of users with existing Hive skills and investments.
A diverse group of individuals within the Hive community are collaborating on these efforts. As part f the community, a wide group of people contributed to this effort, including resources from SAP, Microsoft, Facebook and Hortonworks.
Hadoop to Hadoop Copy ~ March 1st 2013 13:15:21
Here recently I need to copy the content of one hadoop cluster to another for geo redundancy. Thankfully instead of have to write something to do it, Hadoop supply a hand tool to do it "DistCp (distributed copy)".
DistCp is a tool used for large inter/intra-cluster copying. It uses Map/Reduce to effect its distribution, error handling and recovery, and reporting. It expands a list of files and directories into input to map tasks, each of which will copy a partition of the files specified in the source list. Its Map/Reduce pedigree has endowed it with some quirks in both its semantics and execution. The purpose of this document is to offer guidance for common tasks and to elucidate its model.
Here are the basic for using:
bash$ hadoop distcp hdfs://nn1:8020/foo/bar \
This will expand the namespace under /foo/bar on nn1 into a temporary file, partition its contents among a set of map tasks, and start a copy on each TaskTracker from nn1 to nn2. Note that DistCp expects absolute paths.
Here is how you can handle multiple source directories on the command line:
bash$ hadoop distcp hdfs://nn1:8020/foo/a \
Hortonworks Road Show "Big Business Value from Big Data and Hadoop" ~ September 19th 2012 21:55:59
This morning I went to the Hortonworks Road Show. It's wasn't Bad. I have to say out of the Hadoop Vendor I have talked to, I like Hortonworks business model the best.