Interested in taking your company on a journey into cloud computing, big data, and machine learning? You can read more about it and the Google Cloud Platform below.
Having had the opportunity to study towards Google Professional Cloud Architect and Google Professional Data Engineer and getting involved with Google Cloud Platform at client site I could not help but be impressed with the capabilities (data warehousing, big data, data pipelines, and machine learning) available on the platform. I’ll explain…
Businesses have a requirement to analyse data for effective decision making. A big proportion of that data is ‘big data’ collected via the company website and social media platforms. Now because there is a lot of it (data) and its nature tends to be unstructured or semi-structured data the obvious candidate is Apache Hadoop. Why Hadoop?
Hadoop is designed to handle high volumes of unstructured and semi-structured data so it is going to meet big data needs. However, the idea behind Hadoop was not to build a supercomputer for the job but to split (shard) and distribute that data across several computers in a cluster.
‘“In pioneer days they used oxen for heavy pulling and when one ox couldn’t budge a log, they didn’t try to grow a larger ox. We shouldn’t be trying for bigger computers, but for more systems of computers” – Grace Hopper’ (Tom White, 2015, Hadoop The Definitive Guide – Storage & Analysis at Internet Scale, O’Reilly, 4th Edition, p3).
Starting my personal venture into Hadoop my first port of call on the Hadoop journey was the Apache Hadoop website where I downloaded the Hadoop software. However, I quickly realised that the set up was not going to be straight forward and that I would need to acquire additional coding and tuning skills; something that I had neither the time nor appetite for. Also, I was going to need to buy, configure, and importantly maintain new computers and code to make Hadoop effective. This would amount to extra costs in terms of hardware and labour. So my second port of call was to take a free and pre-configured version of Hadoop in the form of a sandbox from Hortonworks and run it on a VM to gain more familiarity. This was partially successful but I still ran into technical problems and of course all I had was a sandbox to experiment with, not something that I could use for processing high volumes of live data.
Third port of call was the cloud and I considered the big three, Amazon AWS, Microsoft Azure, and Google Cloud Platform. I was initially pulled towards Azure because it’s closely tied to MS SQLServer, a RDBMS that I’d been involved with for the best part of twenty years. However, it was when I started using Google Cloud Platform that my progress and ideas with Hadoop started to accelerate.
So, does Google Cloud Platform include Hadoop? Oh yes, and what may not be understood is that Google have been in the game from early on. Hadoop makes use of a file system storage capability known as HDFS (Hadoop Distributed Filesystem). Distributed file systems can be traced back to a paper published by Google in 2003. Then in 2004 Google published a paper introducing MapReduce. Is MapReduce important? Yes, because if you are going to shard your data across numerous computers you are going to want to join it back together again to be consumed, and that is the function of MapReduce.
If you log on to Google Cloud Platform you will not see a button to start Hadoop. But you will see a managed service called Dataproc. When you configure and initiate Dataproc you are starting a cluster that includes Hadoop, Spark, Hive and Pig (members of the Hadoop ecosystem). For me this is a big win:
- I do not need to configure and manage the Hadoop ecosystem – Google does that for me – I can focus on my objectives of analytics and Machine Learning
- I do not need to buy additional hardware, I effectively rent the hardware from Google
- I do not need to maintain (or pay someone to maintain for me) big data software or the hardware it’s installed on
- I only pay for what I use. If my Hadoop/Spark job takes three hours to run then I rent the VMs for three hours. I don’t pay for them when I’m not using them
- Without becoming too technical if I run Hadoop with HDFS I need to have the machines on all the time if I need HDFS to persist my data. On Google Cloud Platform I can persist my data in Google Cloud Storage and power down the VMs when they have finished processing so I don’t pay for their use
- Google seem to have done a great job of integrating the different components of the platform together so it’s not a problem to move data out of Dataproc to BigQuery (the SQL based Google query engine) or into Tensorflow (ML) or Google DataStudio for analytics
As I journeyed along with Dataproc I began to discover many additional benefits to those above. There are platform components that cover Internet of Things (IoT – machine sensor data), Pub/Sub for message streaming (including microservices), Dataflow for streaming and batch processing and loading data, Bigtable (NoSQL) for real time data and Cloud Spanner which is horizontally scalable for very large data warehousing deployments. And of course you can undertake data science. A very good read on data pipelines and machine learning is Valliappa Lakshmana, 2018, Data Science on the Google Cloud Platform – Implementing Real-Time Data Pipelines: From Ingest To Machine Learning, O’Reilly.
So, is there a downside to GCP? You may feel that security is an issue. GCP has a strong security component known as IAM (Identity and Access Management) which to me seems at least as good as that found in SAP BusinessObjects which has stood the test of time. You can also encrypt your data on premise and then move it to the cloud for processing. But you are trusting your data to a Google owned and managed data centre. In some ways that’s no different to a lot of large companies that host their data in data centres that they do not own. But if you feel uncomfortable you can also take a hybrid approach of cloud/on prem. In the hybrid model some of your data will be hosted on the cloud and other (perhaps sensitive) data can be hosted on your premises.
Also, someone at some point, will need to code python or java for something like Dataflow (Apache Beam), Spark or Tensorflow (ML). The coding gives a lot of flexibility but in some ways it feels cumbersome compared to some of the modern ETL tools like SAP Data Services, Informatica, and Microsoft’s SSIS. With those ETL tools you drag and drop components to create your data pipeline rather than having to develop and maintain reams of code. Out of the box functionality reduces the risk of human error and supporting esoteric code.
You can move all your data warehousing capabilities to Google Cloud Platform and Cloud SQL provides OLTP capabilities through RDBMS but keep in mind that many of the challenges of data do not go away just because a bunch of new technologies capable of handling vast amounts of data are in play. I’m referring to solution and data architecture, data quality, data profiling, data cleansing, data transformations, data modeling, transaction updates and tracking history. BI System Builders have an end to end service framework for data management solutions known as Cornerstone Solution®. Cornerstone Solution® addresses all the above challenges. It also fits Google Cloud Platform very well and is currently being redesigned to accommodate data lakes and big data concepts.
Get in touch if you’d like to engage us to help and have a successful journey!