Posted on

Cloud Computing & Big Data

Google Cloud Platform

If you are considering a journey on to Google Cloud Platform, BI System Builders would love you to engage us to help you, so please get in contact. We’re on a similar journey ourselves and you can read about it below.

BI System Builders’ data, or at least a chunk of it, is migrating to Google Cloud Platform (GCP). Having had the opportunity to study towards achieving Google Professional Data Engineer and getting involved with Google Cloud Platform at client site I could not help but be impressed with the capabilities (data warehousing, big data, data pipelines, and machine learning) available on the platform. I’ll explain…

Like many businesses BI System Builders has a requirement to analyse its data for effective decision making. A big proportion of that data is collected via the company website and social media platforms. Now because there is a lot of it (data) and its nature tends to be unstructured or semi-structured data the obvious candidate is Apache Hadoop. Why Hadoop?

Hadoop is designed to handle high volumes of unstructured and semi-structured data so it was going to meet my needs. However, the idea behind Hadoop was not to build a supercomputer for the job but to split (shard) and distribute that data across several computers in a cluster.

‘“In pioneer days they used oxen for heavy pulling and when one ox couldn’t budge a log, they didn’t try to grow a larger ox. We shouldn’t be trying for bigger computers, but for more systems of computers” – Grace Hopper’ (Tom White, 2015, Hadoop The Definitive Guide – Storage & Analysis at Internet Scale, O’Reilly, 4th Edition, p3).

My first port of call on the Hadoop journey was the Apache Hadoop website where I downloaded the Hadoop software. I quickly realised that the set up was not going to be straight forward and that I would need to acquire additional coding and tuning skills; something that I had neither the time nor appetite for. Also, I was going to need to buy, configure, and importantly maintain the new computers and code. This amounted to extra costs in terms of hardware and labour. My second port of call was to take a free and pre-configured version of Hadoop in the form of a sandbox from Hortonworks and run it on a VM to gain more familiarity. This was partially successful but I still ran into technical problems and of course all I had was a sandbox to experiment with, not something that I needed for the company’s live data.

Third port of call was the cloud and I considered the big three, Amazon AWS, Microsoft Azure, and Google Cloud Platform. I was initially pulled towards Azure because it’s closely tied to MS SQLServer, a RDBMS that I’d been involved with for the best part of twenty years. However, it was when I started using Google Cloud Platform that my progress and ideas started to accelerate.

So, does GCP include Hadoop? Oh yes, and what may not be understood is that Google have been in the game from early on. Hadoop makes use of a file system storage capability known as HDFS (Hadoop Distributed Filesystem). Distributed file systems can be traced back to a paper published by Google in 2003. Then in 2004 Google published a paper introducing MapReduce. Is MapReduce important? Yes, because if you are going to shard your data across numerous computers you are going to want to join it back together again to be consumed, and that is the function of MapReduce.

If you log on to Google Cloud Platform you will not see a button to start Hadoop. But you will see a managed service called Dataproc. When you configure and initiate Dataproc you are starting a cluster that includes Hadoop, Spark, Hive and Pig (members of the Hadoop ecosystem). For me this is a big win:

  1. I do not need to configure and manage the Hadoop ecosystem – Google does that for me – I can focus on my objectives of analytics and Machine Learning
  2. I do not need to buy additional hardware, I effectively rent the hardware from Google
  3. I do not need to maintain (or pay someone to maintain for me) big data software or the hardware it’s installed on
  4. I only pay for what I use. If my Hadoop/Spark job takes three hours to run then I rent the VMs for three hours. I don’t pay for them when I’m not using them
  5. Without becoming too technical if I run Hadoop with HDFS I need to have the machines on all the time if I need HDFS to persist my data. On GCP I can persist my data in Google Cloud Storage and power down the VMs when they have finished processing so I don’t pay for their use
  6. Google seem to have done a great job of integrating the different components of the platform together so it’s not a problem to move data out of Dataproc to BigQuery (the SQL based Google query engine) or into Tensorflow (ML) or Google DataStudio for analytics

As well as the benefits above there are platform components that cover IOT (machine sensor data), Pub Sub for message streaming including microservices, Dataflow for streaming and batch processing and loading data, Bigtable (NoSQL) for real time data and Cloud Spanner which is horizontally scalable for very large data warehousing deployments, and of course undertaking data science. A very good read on data pipelines and machine learning  is Valliappa Lakshmana, 2018, Data Science on the Google Cloud Platform – Implementing Real-Time Data Pipelines: From Ingest To Machine Learning, O’Reilly.

So, is there a downside to GCP? You may feel that security is an issue. GCP has a strong security component known as IAM (Identity and Access Management) which to me seems at least as good as that found in SAP BusinessObjects which has stood the test of time. But you are trusting your data to a Google owned and managed data centre. In some ways that’s no different to a lot of large companies that host their data in data centres that they do not own. But if you feel uncomfortable you can also take a hybrid approach of cloud/on prem. In the hybrid model some of your data will be hosted on the cloud and other (perhaps sensitive) data can be hosted on your premises.

Also, someone at some point, will need to code python or java for something like Dataflow (Apache Beam), Spark or Tensorflow (ML). The coding gives a lot of flexibility but in some ways this coding feels like a step backwards from some of the modern ETL tools like SAP Data Services, Informatica, and Microsoft’s SSIS. With those ETL tools you drag and drop components to create your data pipeline rather than having to develop and maintain reams of code. Out of the box functionality reduces the risk of human error and supporting esoteric code.

You can move all your data warehousing capabilities to GCP but keep in mind that many of the challenges of data do not go away just because a bunch of new technologies capable of handling vast amounts of data are in play. I’m referring to solution and data architecture, data quality, data profiling, data cleansing, data transformations, data modelling, transaction updates and tracking history. BI System Builders have an end to end service framework for data management solutions known as Cornerstone Solution®. Cornerstone Solution® addresses all the above challenges. It also fits GCP very well and is currently being redesigned to accommodate data lakes and big data concepts. Have a successful journey!