Posted on

Cloud Computing & Big Data

Google Cloud Platform

Interested in taking your company on a journey into cloud computing, big data, and machine learning? You can read more about it and the Google Cloud Platform below.

Having had the opportunity to study towards Google Professional Cloud Architect and Google Professional Data Engineer and getting involved with Google Cloud Platform at client site I could not help but be impressed with the capabilities (data warehousing, big data, data pipelines, and machine learning) available on the platform. I’ll explain…

Businesses have a requirement to analyse data for effective decision making. A big proportion of that data is ‘big data’ collected via the company website and social media platforms. Now because there is a lot of it (data) and its nature tends to be unstructured or semi-structured data the obvious candidate is Apache Hadoop. Why Hadoop?

Hadoop is designed to handle high volumes of unstructured and semi-structured data so it is going to meet big data needs. However, the idea behind Hadoop was not to build a supercomputer for the job but to split (shard) and distribute that data across several computers in a cluster.

‘“In pioneer days they used oxen for heavy pulling and when one ox couldn’t budge a log, they didn’t try to grow a larger ox. We shouldn’t be trying for bigger computers, but for more systems of computers” – Grace Hopper’ (Tom White, 2015, Hadoop The Definitive Guide – Storage & Analysis at Internet Scale, O’Reilly, 4th Edition, p3).

Starting my personal venture into Hadoop my first port of call on the Hadoop journey was the Apache Hadoop website where I downloaded the Hadoop software. However, I quickly realised that the set up was not going to be straight forward and that I would need to acquire additional coding and tuning skills; something that I had neither the time nor appetite for. Also, I was going to need to buy, configure, and importantly maintain new computers and code to make Hadoop effective. This would amount to extra costs in terms of hardware and labour. So my second port of call was to take a free and pre-configured version of Hadoop in the form of a sandbox from Hortonworks and run it on a VM to gain more familiarity. This was partially successful but I still ran into technical problems and of course all I had was a sandbox to experiment with, not something that I could use for processing high volumes of live data.

Third port of call was the cloud and I considered the big three, Amazon AWS, Microsoft Azure, and Google Cloud Platform. I was initially pulled towards Azure because it’s closely tied to MS SQLServer, a RDBMS that I’d been involved with for the best part of twenty years. However, it was when I started using Google Cloud Platform that my progress and ideas with Hadoop started to accelerate.

So, does Google Cloud Platform include Hadoop? Oh yes, and what may not be understood is that Google have been in the game from early on. Hadoop makes use of a file system storage capability known as HDFS (Hadoop Distributed Filesystem). Distributed file systems can be traced back to a paper published by Google in 2003. Then in 2004 Google published a paper introducing MapReduce. Is MapReduce important? Yes, because if you are going to shard your data across numerous computers you are going to want to join it back together again to be consumed, and that is the function of MapReduce.

If you log on to Google Cloud Platform you will not see a button to start Hadoop. But you will see a managed service called Dataproc. When you configure and initiate Dataproc you are starting a cluster that includes Hadoop, Spark, Hive and Pig (members of the Hadoop ecosystem). For me this is a big win:

  1. I do not need to configure and manage the Hadoop ecosystem – Google does that for me – I can focus on my objectives of analytics and Machine Learning
  2. I do not need to buy additional hardware, I effectively rent the hardware from Google
  3. I do not need to maintain (or pay someone to maintain for me) big data software or the hardware it’s installed on
  4. I only pay for what I use. If my Hadoop/Spark job takes three hours to run then I rent the VMs for three hours. I don’t pay for them when I’m not using them
  5. Without becoming too technical if I run Hadoop with HDFS I need to have the machines on all the time if I need HDFS to persist my data. On Google Cloud Platform I can persist my data in Google Cloud Storage and power down the VMs when they have finished processing so I don’t pay for their use
  6. Google seem to have done a great job of integrating the different components of the platform together so it’s not a problem to move data out of Dataproc to BigQuery (the SQL based Google query engine) or into Tensorflow (ML) or Google DataStudio for analytics

As I journeyed along with Dataproc I began to discover many additional benefits to those above. There are platform components that cover Internet of Things (IoT – machine sensor data), Pub/Sub for message streaming (including microservices), Dataflow for streaming and batch processing and loading data, Bigtable (NoSQL) for real time data and Cloud Spanner which is horizontally scalable for very large data warehousing deployments. And of course you can undertake data science. A very good read on data pipelines and machine learning is Valliappa Lakshmana, 2018, Data Science on the Google Cloud Platform – Implementing Real-Time Data Pipelines: From Ingest To Machine Learning, O’Reilly.

So, is there a downside to GCP? You may feel that security is an issue. GCP has a strong security component known as IAM (Identity and Access Management) which to me seems at least as good as that found in SAP BusinessObjects which has stood the test of time. You can also encrypt your data on premise and then move it to the cloud for processing. But you are trusting your data to a Google owned and managed data centre. In some ways that’s no different to a lot of large companies that host their data in data centres that they do not own. But if you feel uncomfortable you can also take a hybrid approach of cloud/on prem. In the hybrid model some of your data will be hosted on the cloud and other (perhaps sensitive) data can be hosted on your premises.

Also, someone at some point, will need to code python or java for something like Dataflow (Apache Beam), Spark or Tensorflow (ML). The coding gives a lot of flexibility but in some ways it feels cumbersome compared to some of the modern ETL tools like SAP Data Services, Informatica, and Microsoft’s SSIS. With those ETL tools you drag and drop components to create your data pipeline rather than having to develop and maintain reams of code. Out of the box functionality reduces the risk of human error and supporting esoteric code.

You can move all your data warehousing capabilities to Google Cloud Platform and Cloud SQL provides OLTP capabilities through RDBMS but keep in mind that many of the challenges of data do not go away just because a bunch of new technologies capable of handling vast amounts of data are in play. I’m referring to solution and data architecture, data quality, data profiling, data cleansing, data transformations, data modeling, transaction updates and tracking history. BI System Builders have an end to end service framework for data management solutions known as Cornerstone Solution®. Cornerstone Solution® addresses all the above challenges. It also fits Google Cloud Platform very well and is currently being redesigned to accommodate data lakes and big data concepts.

Get in touch if you’d like to engage us to help and have a successful journey!

Posted on

Business Intelligence & Data Analytics

Business Intelligence (BI) is really the process of taking your raw data and transforming it into useable information. The final outcome is usually a set of data analytics and dashboards that help business experts make informed decisions.

Big data and machine learning technologies and capabilities are very important but the analysis of high volumes of unstructured data does not replace the ‘business as usual’ (BAU) requirement of most businesses satisfied through business intelligence. Neither does it replace the need for financial and legislative reporting. The reality is that organisations need both. Consequently, the concept of the ‘(data) warehouse by the (data) lake is real and is accepted by forward thinking big data vendors like Hortonworks and thought leaders like Dr. Barry Devlin1.

BI System Builders have assisted businesses in the business intelligence and data warehousing space for the last ten years. Contact us if you need guidance through the maize of data technologies and capabilities or if you want help to implement a data warehouse, a business intelligence system, or a set of data analytics. You can read some of our articles below for more information.

Basic BI Terminology

The End To End BI Concept

End To End BI 1

End To End BI 2

Business Intelligence

Avoid BI Breakpoints With Cornerstone Solution®

Business Intelligence Architecture

BI Solutions

Data Analytics

Data Warehousing and Data Pipelines

Cornerstone Solution® 101

Business Intelligence Services

1 Dr. Barry Devlin, April 2017, The EDW Lives On – The Beating Heart of the Data Lake, White Paper sponsored by Hortonworks

Posted on

Collaborative Data Modeling

dimensional modeling

Get in touch to discuss your data modeling requirements; we have very high levels of expertise and don’t like to be beaten on price. You can read more on our methods below.

Our approach is called collaborative data modeling; it brings data modelers together with business subject matter experts to collaboratively design data models that work. BI System Builders uses a structured approach known as Business Event Analysis and Modeling (BEAM) to achieve this. We’ve been associated with the author Lawrence Corr for several years, and you can read about the BEAM method developed by Lawrence in his book Agile Data Warehouse Design (ADWD). You can also read our book review of ADWD here.

We use the BEAM techniques in collaborative workshops to identify the data stories within a business. This is achieved through lots of idea generation, whiteboarding and scribing; a technique known as modelstorming. Modelstorming is similar in concept to brainstorming but the effort is aimed at developing data models.

We’ve been operating the collaborative workshops for several years and at the following companies, Volkswagen Group, Dixons Retail, NFU Mutual, Vision Express, and Interflora.

The data models developed are known as dimensional models and are based on Ralph Kimball concepts. The main data modeler is Russell Beech who has twenty years dimensional modeling experience across numerous industries and also took training directly from Ralph.

There are three ways that we can help you:

  1. You’ve already tried modelstorming and dimensional modeling yourself but things are not progressing as you’d hope. Don’t worry, we can come on site and work with you to run some workshops to get you started, passing on our knowledge and experience as we go.
  • You already have reports and analytics but you feel that they are overly complex, they’re difficult to customise and support, and when business users come with new requirements, developing the new reports can be very challenging. The issue here is frequently the underlying data model. We offer a service where we analyse and assess your existing data models against best practice designs and make recommendations. If the recommendation is a new design, we can help you with the new design.
  • You don’t have any data models yet but you would like to. Perhaps your current reporting is based on spreadsheets and you now need more flexibility. You want to get started on the right foot with best practice data models that meet your reporting and analytical needs. We can come on site, take you through collaborative data modeling workshops, and design your data models for you.

Tel: +44 (0) 1926 623111 or leave your details here here.