Doug Cutting, the chief architect, Cloudera, also happens to be the father of Hadoop. He delivered the keynote at the Cloudera 2019 event in Mumbai.
How do you undertake a journey to the data cloud? Digital transformation is going on around us. It is happening across all aspects of the society. We are now learning how to integrate new technologies.
Change has accelerated in the past decade. Earlier, systems were deployed with the expectation that they would last forever. They were not designed to look at each other's data, and were fairly limiting. Open source was a new idea in the early 2000s. People began to adopt Lucene, a software that Doug himself had written. There was no institutional backing or publicity. Open source emerged as a tool for development.
Nutch started in 2003. Around 2005, Google published a paper on how they build search engines. They had a paper talking about how they had automated things. We started working on reworking Nutch in 2004. The tale of debugging is much longer. In 2006, I joined Yahoo! I developed Hadoop. Hadoop was named after my son's toy elephant. It was a distributing computing platform, based on Google's ideas.
A group of people believed that Hadoop could be used much further. Together, they formed Cloudera. I joined Cloudera in 2009. Stepping back from my lesson in Hadoop, if you can increase the scale and focus on flexibility, you can permit them to store more data in raw form and experiment. They can innovate more quickly. The waterfall method inhibited process through data. This gave us a much more appropriate platform. Most of the past data was relational.
New sources of data are events, things recorded from sensors, etc. We need a different class of tools. Companies can run petabytes of data easily today. Software is also eating the world. In every industry, everywhere, the advances being made are predominantly using software. A company’s growth is fuelled more by data, today. The use of data is no longer isolated. It has emerged everywhere.
There are some challenges. First, there are new set of technologies. There are over 10 open source technologies going around, in space. If you have an idea or an inspiration, it is fairly easy to figure out what technologies will work in future. The larger challenges are institutional or cultural. It changes an organization’s structure, if you are using data. Much of the data that we now have concerns people. We haven’t done a good job so far, in managing people’s rights. If we want to keep ourselves from being regulated, we would need to be even more responsible. Data ethics needs our full attention.
The journey has been interesting so far. The best way to start is small. Start with a project that has a high probability of success. It will provide you with insights that you never had before. You need to be motivated by concrete problems. IT should also permit experimentation. You need to put all the right mechanisms in place, to transform. As we improve our capabilities, we will also keep on incorporating new technologies, systems, etc. We also need to create systems where we can expect change. It is going to be competitive in future.
Analytics is fundamentally about counting. Data science involves fancy counting, involving some fancy mathematics. You can possibly look and learn what you think is appropriate. ML is also fancy counting with feedback. You need to understand the trade offs for these.
Cloudera was launched with the ambition of customers putting their data in the cloud. Now, people are more comfortable with storing data in the cloud. There will still be some data in the premises. Cloud helps in agility, self-serve experiments, helps you transform, etc. Its also elastic. You can only pay for as much as you need.
Its really happening!
Tools are being adopted across every industry. Especially, in healthcare! It is also being used in telecom, banking, retail, etc. There are real apps being deployed at greater and greater levels. Welcome to the data age! This is the fuel to serve us, through the century. Users drive the innovation. Together, we can make tremendous advances.
As for the future for Hadoop, its a philosophy, a style. It's a set of open source projects that lets you innovate more rapidly. People miss how slow enterprises move. They don't adopt technologies much faster. Hadoop is a long process. The architecture of having open source projects is taking place. For instance, the NYSE moved to Cloudera.
We are adding more applications around edge. We will see more around 5G and add to data processing. We are moving away from traditional, as a style, than software. Its a different market when you are developing in open source. We will continue to see new technologies being invented. We are in a golden age for creation of data software.
We are also seeing rapid movement to the cloud. There will be hybrid and multi-clouds. There will be private clouds as well. There will be many advantages. Private clouds have moved slowly, but it will pick up.
Delivering the enterprise data cloud
Earlier, presenting the inaugural address, Vinod Ganesan, Country Manager, India and SAARC, Cloudera, dwelt on how do we deliver the enterprise data cloud?
A lot had happened last year. That, Hadoop is dead narrative, kept flooding the Internet. However, we weathered all this negativity by continuing to focus on executing our vision of delivering the industry's first enterprise data cloud, which was very well received by customers and partners. This platform has four key tenants, hybrid and multi-cloud, which is extremely crucial for you to lend agility to your business, multi-function, which allows you to onboard any type of workload, be it streaming, analytics, and machine learning at scale.
It needs to be a highly secure and governed , given that you will be moving data across hybrid clouds, and lastly, it needs to be open. When we say open, it's not just the open source that leverages innovation from the community, but also open APIs and open standards that allow you to integrate data models and make the framework highly interoperable.
Today, we are providing our customers with three unique experiences. The data hub experience, which is self-service experience that allows you to build your analytics applications at scale, with security and governance. The data warehouse experience, a self-service experience, which allows you to run advanced analytics to solve specific problems. Lastly, the machine learning experience, which unlocks the economies of cloud, and allows data scientists to bring the tools of their choose in a governed manner. That accelerates the devops journey of ML, thereby helping the organizations to fulfill their ultimate quest of being AI driven.
Hadoop is a movement towards a modern architecture for managing and analyzing data. There is a disaggregated software stack. Hierarchy needs for a data-driven enterprise, in the AI ladder, include AI, ML, data science, etc. Cloudera is strong in industries that have strong data strategy.
As per a survey, 69% of executives said their organizations need a comprehensive data strategy to meet their goals. Business demands speed and agility. IT has a mandate to secure and govern data. Enterprises want to say yes to CDP.
There are four fundamental key values. First, support any hybrid or multi-cloud environments. There is a need to be multi-functional. These are required to drive business agility. If you are making it multi-cloud and multi-functional, they should be better managed.
Next, they must be secure and governed, and be open. There should open data formats, open APIs, etc. This led to a new category - enterprise data cloud. Cloudera data platform gives you the capability to run a hybrid cloud. It is built on open standards.
Data hub is a cloud-native service. You can start building analytics applications at scale and in a secured manner. Data warehouse allows you to run advanced, self-serviced analytics. You have a targeted business problem. ML, looks at how do you make organizations use the tools in a governed manner. You should be able to deploy the models faster.
We will continue to embrace emerging technologies. There are lot of projects driven by open source. We have integrated these technologies, and improved the agility required to deliver.