Three streams of technological development are converging to create a unique opportunity to leverage computing power and data to compete in the modern world. There is an emerging need for a multi-skilled workforce, with data at its core. In this article we review some of the defining events in the field of technology and look at what history can tell us about how to act in the present and when preparing for the future.
Technological Development and the Power of Convergence.
It’s been almost 70 years since Alan Turing came up with his test to judge the intelligence of machines.
It’s been 55 years since the first database came into existence.
It’s been over 50 years since the co-founder of Intel stated that the number of transistors in an integrated circuit would double every year, which became known as Moore’s Law.
Since that time, there have been many technological advances, particularly in the fields of AI, Data Management and Computing Infrastructure. It is hard to imagine (or remember) a world before the Internet, Personal Computers, or even smartphones. Technology is rapidly changing and it can be challenging to keep pace with new developments. However, it is also good to take stock of how far we have come.
A Potted History
Blaise Pascal is credited with inventing the first calculating machine, in the 17th century, driven by the need to ease the burden of calculating tax, a burden that unfortunately remains with us today. Charles Babbage made progress in the mid-19th Century with the invention of the first general purpose computer, interestingly called an Analytical Engine. How prophetic. Babbage’s concept was one of the earliest examples of an analytics ‘project’ consisting of a brilliant idea, design ahead of its time but let down by the physical implementation – Babbage’s machine has never been fully built.
In the early 20th century, modern computing received a jolt with the invention of the vacuum tube and transistor driven by the desire to amplify radio signals over long distances. As early warehouse-sized mainframes gave way to mini computers and eventually to desktop sized devices in the 1980’s, new possibilities were being opened up for increased productivity and automation of manual tasks.
Around the same time, the early begins of the Internet were being developed, by the US Department of Defence ARPANET, consisting of a series of networked computers communicating via the TCP/IP protocol that has become the basis of Internet communications today, and the first Relational Databases were being made commercially available, supporting the organisation of data for analysis and reporting. It’s also around this time that the now ubiquitous spreadsheet application was developed. Charting the history of Apple’s VisiCalc (try it at archive.org), Microsoft’s relatively unsuccessful Multiplan, and the rise and fall of Lotus 1-2-3, before all were eclipsed by Microsoft Excel, would make an interesting article in its own right.
In 1996, IBM developed Deep Blue, the machine that beat the reigning chess champion, Gary Kasparov, and the first smartphones were being widely adopted in Japan, although interestingly the i-mode protocol they used did not receive broad adoption outside of that country.
2003 saw the publication of the Google File System research paper outlining the technique of distributing files across multiple servers to support storage and analysis of large datasets. Although not the first to suggest this technique it was to become the basis for similar approaches used by Yahoo to create the Hadoop Distributed File System (HDFS) widely used today – more later.
More recently, in March 2016 the distributed version of AlphaGo, the program developed by Google’s DeepMind to play the board game Go, beat Lee Sedol, the second highest ranking (human) Go player. DeepMind leveraged Google’s Cloud Compute offering, consisting of on-demand elastic compute, distributed storage and networking capacity, and Monte-Carlo tree search and Deep Learning techniques to achieve the feat. Go-winning computers don’t have an obvious use case in modern corporations, but the convergence of the underlying enabling technologies does open up new opportunities for commercial application.
However, if history has taught us one thing, it is that we should not blindly accept the fashion of the day, but make informed decisions based upon a sound understanding of the bigger picture and the reasons things are the way they are. That does not mean that there is no room for innovation – quite the contrary. Understanding mistakes and flaws in our history can only improve effective innovation.
Four Lessons from History
1. Do not get distracted by semantics or short term ‘hype-cycles’
The term ‘Big Data’ was generally acknowledged to have first been used by John Mashey in 1998 while at Silicon Graphics. In his presentation at Usenix (originally the Unix User Group), he discussed the challenges that will be faced by computing infrastructure in the following year to meet the growing demands caused by large volumes of data. At that time, one of the challenges he foresaw was the need to be able to store up to 40 GB of information on to a disk, a volume that can now easily be stored (and lost!) on inexpensive flash drives. Doug Laney expanded this definition 3 years later in his Meta Group (now Gartner) paper, describing the key attributes of Big Data as the 3 V’s: Volume, Velocity and Variety.
Both these references were in a world prior to the development of cloud computer or distributed file systems such as Hadoop. In Mashey’s case, it was even prior to the adoption of smartphones. Neither author could have accurately predicted the growth in data driven by digital channels and online activity, but both knew that new capability would be required to handle changing future needs.
There is some current discussion in the industry about whether Big Data is dead. Regardless of whether the term is becoming unfashionable, the need to handle and interpret growing amounts of different types of data in close to real-time remains. The focus of any data strategy should be on meeting the underlying need rather than implementing a solution because it is branded as ‘Big Data’. Considering future needs, strategies should aim to answer the question: how can the organisation benefit from what is now made possible by the convergence of elastic infrastructure, distributed file systems and sophisticated algorithms?
Another example from the archives to consider: the term ‘Data Science’ was first used by Danish computer scientist Peter Naur in 1960. Using his definition, a data scientist is merely a computer scientist or statistician. Computer scientists have been around for a while now but very few people with that title have ever been allowed a direct conversation with the head of a marketing function or the CEO! Perhaps data scientist just sounds better?
2. There is never a ‘right time’ to invest
Many concepts thought to be ‘new’ have in fact been evolving over many years. The first chatbots appeared over 50 years ago, although they were not given the moniker back then. The R programming language is over 20 years old and is based on an even older language, S. Waiting for the ‘final’ version of a product or technology could result in missed opportunities. However, the key is understanding the real and current capabilities of the technology.
ELIZA was one of the first implementations of what are now called chatbots. It was a relatively simple program that responded to keywords in a typed conversation to give the impression of an intelligent interaction. Any business that had implemented ELIZA in their customer service centre in 1966 when the program was first developed at MIT would have quickly discovered the limitations and negative impact it could have on customer experience! The lesson is to be realistic about what the current technology can do and to understand what its limitations are.
(You can try an implementation of ELIZA for yourself at manifestation.com and even view the Java code that generates this ELIZA implementation by right-clicking and viewing source)
3. Understanding the history of development can highlight the strengths of a particular technology
Distributed File Systems
The Hadoop framework, and specifically the Hadoop Distributed File System (HDFS), was developed by Yahoo to primarily address the storage of unstructured data being generated by search engines. HDFS itself was developed based on the ground-breaking research at Google and their paper on the Google File System in 2003. No traditional method of storage, and certainly no relational database, could handle the sheer volume and variety of data being generated at the time. The additional effort of managing multiple server nodes and clusters (such as the need to handle outages) were offset by the benefits of having access to large, rich datasets that could be processed in a parallel fashion.
However, that does not mean that distributed file storage is the perfect method for storing and querying all forms of data. There are many organisational and operational management challenges that mean more traditional methods of data management are better suited. For example, there may be a time in the future where Finance departments are managing and reporting revenue / profit streams on a real-time basis. However, most internal and external financial reporting current relies on reconciled metrics reported on a formal and regular basis to a published schedule. In this world, data has a definite structure and pattern.
Much of the use of Hadoop and equivalents have been driven by cost, or the belief that open source and on-demand means the solution must be cheaper. However, the required business need and the functionality of the solution must be matched for an implementation to have a total cost of ownership that is lower over the longer term.
(It may be cheaper to hire a large circular saw from a hire-shop than it is to buy a chain saw from a hardware store. However, if your job is chopping down trees, you will quickly discover the most cost effective solution over the longer term - even though it may be possible to use either tool for the job).
The original Data Lake concept was about storage of raw data from a single source to support any type of analysis. James Dixon, CTO of Pentaho, first publically described the concept on his blog in 2010 and later in a 2011 Forbes article. At that time the concept was to store data in a massive, easily accessible repository and which was only organised when questions needed answers. According to the article, it was aimed at addressing some of the issues with OLAP (or Online Analytical Processing) and Business Intelligence tools that required data to be stored in a pre-structured way before it could be queried, typically in a data mart, limiting the type of analysis that could be performed.
In his updated blog in 2014, Dixon noted that the Data Lake does not (yet) replace the Data Warehouse, and often companies have only one source of data that qualifies for storage in a Lake. To quote “.. a Data Lake is not a data warehouse housed in Hadoop. If you store data from many systems and join across them, you have a Water Garden, not a Data Lake”, with a water garden being a series of connected ‘data pools’.
Is it possible that in more recent years the concept has been mis-used by vendors and service providers to sell more products and services? There is no doubt that that the concept has evolved since the original definition of the term, supported by advances in the technology, and that the ability to parallel process large data sets is driving new possibilities. However, some of the challenges with using a Data Lake as the enterprise repository for Management Reporting (such as ownership, agreed and consistent definition of metrics) should be understood before decommissioning classical data warehouses.
[Dixon’s / Pentaho’s series of YouTube videos make for fascinating viewing for anyone looking to get a solid understanding of the development history of Data Lakes and Hadoop. To paraphrase Dixon “Hadoop was developed to allow Yahoo to index the internet. How does this compare to Business Intelligence? In BI, no-one is looking to index the internet”]
4. What was once the domain of specialists becomes a common skill, and new skill requirement emerge
In the not too distant past, only mathematicians had heard of the R programming language, and MATLAB was taught (and quickly forgotten in my own case) only in universities. Today, R is common place in analytics teams and MATLAB is widely used in investment teams. What were once largely academic tools are now used for better informed commercial decisions.
However, placing an academic mathematician in a corporate environment often does not result in an optimal outcome. There are many soft skills required, not least of which is the acceptance of an imperfect solution, or understanding of corporate politics. Furthermore, there is a need to understand the commercial drivers of a business to understand how best to apply the science.
The convergence of the technology is driving the need for multi-skilled people who can grasp advanced analytical concepts, appreciate the need for robust data sets, and leverage modern computing infrastructure to address real business challenges. That does not mean that there is no longer a need for specialists in each field, but to quote a recent McKinsey research article, there is a need for a “translator….who not only understands the data science but also how it can be applied to the business”.
We are at an exciting moment. The nexus of technology trends is providing new opportunities for commercial applications and more sophisticated business models. However, “those who don’t know history are destined to repeat it” wrote the English philosopher Edmund Burke in the 18th century. Forming a plan for the future requires a firm understanding of historical developments and converged skillsets are need to take advantage of the opportunities presented.