Bookmark and Share
Managing Data Flow for Hadoop with Falcon. ~ May 23rd 2013 11:13:13

There is a new project in the Apache incubator called Falcon.

Abstract

Falcon is a data processing and management solution for Hadoop designed for data motion, coordination of data pipelines, lifecycle management, and data discovery. Falcon enables end consumers to quickly onboard their data and its associated processing and management tasks on Hadoop clusters. 

Proposal

Falcon will enable easy data management via declarative mechanism for Hadoop. Users of Falcon platform simply define infrastructure endpoints, data sets and processing rules declaratively. These declarative configurations are expressed in such a way that the dependencies between these configured entities are explicitly described. This information about inter-dependencies between various entities allows Falcon to orchestrate and manage various data management functions.

The key use cases that Falcon addresses are:

  • Data Motion
  • Process orchestration and scheduling
  • Policy-based Lifecycle Management
  • Data Discovery
  • Operability/Usability

With these features it is possible for users to onboard their data sets with a comprehensive and holistic understanding of how, when and where their data is managed across its lifecycle. Complex functions such as retrying failures, identifying possible SLA breaches or automated handling of input data changes are now simple directives. All the administrative functions and user level functions are available via RESTful APIs. CLI is simply a wrapper over the RESTful APIs.

 

This seems to be a very interesting project with a lot potential. For more information check it out at http://wiki.apache.org/incubator/FalconProposal



Share |

Big Data could generate millions of new jobs ~ May 21st 2013 20:42:13

FORTUNE -- With data analytics now one of the fastest growing fields in IT, it stands to reason that data scientists are in demand. That's great for people with the requisite skills. The problem, according to Peter Sondergaard, a senior vice president at IT research firm Gartner, is that there aren't enough of them.

 

http://management.fortune.cnn.com/2013/05/21/big-data-jobs-2



Share |

Hive Stringer Initiative. Hive Queries 100 Times Faster. ~ May 14th 2013 15:33:46

What is stringer you ask.. Stringer is an initiative to enable Hive human-time use cases (i.e. queries in the 5-30 second range) such as big data exploration, visualization, and parameterized reports without needing to resort to yet another tool to install, maintain and learn can deliver a lot of value to the large community of users with existing Hive skills and investments.

A diverse group of individuals within the Hive community are collaborating on these efforts. As part f the community, a wide group of people contributed to this effort, including resources from SAP, Microsoft, Facebook and Hortonworks.



Share |

Hadoop to Hadoop Copy ~ March 1st 2013 13:15:21

Here recently I need to copy the content of one hadoop cluster to another for geo redundancy. Thankfully instead of have to write something to do it, Hadoop supply a hand tool to do it "DistCp (distributed copy)".

 

DistCp is a tool used for large inter/intra-cluster copying. It uses Map/Reduce to effect its distribution, error handling and recovery, and reporting. It expands a list of files and directories into input to map tasks, each of which will copy a partition of the files specified in the source list. Its Map/Reduce pedigree has endowed it with some quirks in both its semantics and execution. The purpose of this document is to offer guidance for common tasks and to elucidate its model.

 

Here are the basic for using:

 

bash$ hadoop distcp hdfs://nn1:8020/foo/bar \ 

                    hdfs://nn2:8020/bar/foo

 

This will expand the namespace under /foo/bar on nn1 into a temporary file, partition its contents among a set of map tasks, and start a copy on each TaskTracker from nn1 to nn2. Note that DistCp expects absolute paths.

 

Here is how you can handle multiple source directories on the command line:

 

bash$ hadoop distcp hdfs://nn1:8020/foo/a \ 

                    hdfs://nn1:8020/foo/b \ 

                    hdfs://nn2:8020/bar/foo   

 

 

 



Share |

Hortonworks Road Show "Big Business Value from Big Data and Hadoop" ~ September 19th 2012 21:55:59

This morning I went to the Hortonworks Road Show. It's wasn't Bad. I have to say out of the Hadoop Vendor I have talked to, I like Hortonworks business model the best.
 
The fact that they are a large committer to the Apache Hadoop Project, along with several other sub projects such as Apache Ambari Project doesn't hurt. They seem to be more community based then the others. If you have a chance or know someone that would like a good introduce to hadoop I would recommend that they go.

http://info.hortonworks.com/RoadShowFall2012.html?mktotrk=roadshow

--Peace

 



Share |

August 2009
September 2009
November 2010
August 2010
September 2010
February 2010
October 2010
March 2010
June 2011
September 2011
January 2011
October 2011
May 2011
November 2011
September 2012
February 2012
April 2012
June 2012
March 2013
May 2013


Support Wikipedia



Other website from Geek Boy Enterprises, Inc.
Web-Geek.com | Exconnect.com | BenzingerBundles.com | RogerHosto.com
MyAppLinks.com | Forks-wa.com