Posted on

Cloud Computing & Big Data

Google Cloud Platform

Interested in taking your company on a journey into cloud computing, big data, and machine learning? You can read more about it and the Google Cloud Platform below.

Having had the opportunity to study towards Google Professional Cloud Architect and Google Professional Data Engineer and getting involved with Google Cloud Platform at client site I could not help but be impressed with the capabilities (data warehousing, big data, data pipelines, and machine learning) available on the platform. I’ll explain…

Businesses have a requirement to analyse data for effective decision making. A big proportion of that data is ‘big data’ collected via the company website and social media platforms. Now because there is a lot of it (data) and its nature tends to be unstructured or semi-structured data the obvious candidate is Apache Hadoop. Why Hadoop?

Hadoop is designed to handle high volumes of unstructured and semi-structured data so it is going to meet big data needs. However, the idea behind Hadoop was not to build a supercomputer for the job but to split (shard) and distribute that data across several computers in a cluster.

‘“In pioneer days they used oxen for heavy pulling and when one ox couldn’t budge a log, they didn’t try to grow a larger ox. We shouldn’t be trying for bigger computers, but for more systems of computers” – Grace Hopper’ (Tom White, 2015, Hadoop The Definitive Guide – Storage & Analysis at Internet Scale, O’Reilly, 4th Edition, p3).

Starting my personal venture into Hadoop my first port of call on the Hadoop journey was the Apache Hadoop website where I downloaded the Hadoop software. However, I quickly realised that the set up was not going to be straight forward and that I would need to acquire additional coding and tuning skills; something that I had neither the time nor appetite for. Also, I was going to need to buy, configure, and importantly maintain new computers and code to make Hadoop effective. This would amount to extra costs in terms of hardware and labour. So my second port of call was to take a free and pre-configured version of Hadoop in the form of a sandbox from Hortonworks and run it on a VM to gain more familiarity. This was partially successful but I still ran into technical problems and of course all I had was a sandbox to experiment with, not something that I could use for processing high volumes of live data.

Third port of call was the cloud and I considered the big three, Amazon AWS, Microsoft Azure, and Google Cloud Platform. I was initially pulled towards Azure because it’s closely tied to MS SQLServer, a RDBMS that I’d been involved with for the best part of twenty years. However, it was when I started using Google Cloud Platform that my progress and ideas with Hadoop started to accelerate.

So, does Google Cloud Platform include Hadoop? Oh yes, and what may not be understood is that Google have been in the game from early on. Hadoop makes use of a file system storage capability known as HDFS (Hadoop Distributed Filesystem). Distributed file systems can be traced back to a paper published by Google in 2003. Then in 2004 Google published a paper introducing MapReduce. Is MapReduce important? Yes, because if you are going to shard your data across numerous computers you are going to want to join it back together again to be consumed, and that is the function of MapReduce.

If you log on to Google Cloud Platform you will not see a button to start Hadoop. But you will see a managed service called Dataproc. When you configure and initiate Dataproc you are starting a cluster that includes Hadoop, Spark, Hive and Pig (members of the Hadoop ecosystem). For me this is a big win:

  1. I do not need to configure and manage the Hadoop ecosystem – Google does that for me – I can focus on my objectives of analytics and Machine Learning
  2. I do not need to buy additional hardware, I effectively rent the hardware from Google
  3. I do not need to maintain (or pay someone to maintain for me) big data software or the hardware it’s installed on
  4. I only pay for what I use. If my Hadoop/Spark job takes three hours to run then I rent the VMs for three hours. I don’t pay for them when I’m not using them
  5. Without becoming too technical if I run Hadoop with HDFS I need to have the machines on all the time if I need HDFS to persist my data. On Google Cloud Platform I can persist my data in Google Cloud Storage and power down the VMs when they have finished processing so I don’t pay for their use
  6. Google seem to have done a great job of integrating the different components of the platform together so it’s not a problem to move data out of Dataproc to BigQuery (the SQL based Google query engine) or into Tensorflow (ML) or Google DataStudio for analytics

As I journeyed along with Dataproc I began to discover many additional benefits to those above. There are platform components that cover Internet of Things (IoT – machine sensor data), Pub/Sub for message streaming (including microservices), Dataflow for streaming and batch processing and loading data, Bigtable (NoSQL) for real time data and Cloud Spanner which is horizontally scalable for very large data warehousing deployments. And of course you can undertake data science. A very good read on data pipelines and machine learning is Valliappa Lakshmana, 2018, Data Science on the Google Cloud Platform – Implementing Real-Time Data Pipelines: From Ingest To Machine Learning, O’Reilly.

So, is there a downside to GCP? You may feel that security is an issue. GCP has a strong security component known as IAM (Identity and Access Management) which to me seems at least as good as that found in SAP BusinessObjects which has stood the test of time. You can also encrypt your data on premise and then move it to the cloud for processing. But you are trusting your data to a Google owned and managed data centre. In some ways that’s no different to a lot of large companies that host their data in data centres that they do not own. But if you feel uncomfortable you can also take a hybrid approach of cloud/on prem. In the hybrid model some of your data will be hosted on the cloud and other (perhaps sensitive) data can be hosted on your premises.

Also, someone at some point, will need to code python or java for something like Dataflow (Apache Beam), Spark or Tensorflow (ML). The coding gives a lot of flexibility but in some ways it feels cumbersome compared to some of the modern ETL tools like SAP Data Services, Informatica, and Microsoft’s SSIS. With those ETL tools you drag and drop components to create your data pipeline rather than having to develop and maintain reams of code. Out of the box functionality reduces the risk of human error and supporting esoteric code.

You can move all your data warehousing capabilities to Google Cloud Platform and Cloud SQL provides OLTP capabilities through RDBMS but keep in mind that many of the challenges of data do not go away just because a bunch of new technologies capable of handling vast amounts of data are in play. I’m referring to solution and data architecture, data quality, data profiling, data cleansing, data transformations, data modeling, transaction updates and tracking history. BI System Builders have an end to end service framework for data management solutions known as Cornerstone Solution®. Cornerstone Solution® addresses all the above challenges. It also fits Google Cloud Platform very well and is currently being redesigned to accommodate data lakes and big data concepts.

Get in touch here if you’d like to engage us to help. Have a successful journey!

Posted on

University of Warwick Science Park Press Release – July 2017

A successful partnership is being forged between a university and a business intelligence specialist that’s going after big data.

Click here to read the article.

BI System Builders is discovering the huge benefits of being based at the University of Warwick Science Park’s Warwick Innovation Centre.
Founded by Russell Beech in 2009, the firm designs and builds business intelligence systems across a whole host of sectors. Since it first started, BI System Builders has worked with a number of high-profile clients including Volkswagen Group, Dixons Retail and Vision Express.

The company has developed the Cornerstone Solution, which is an expert application of BI System Builders’ set of services, techniques and knowhow that have been developed over the past 20 years. The Cornerstone Solution was created by Russell and is a programme that is customised to individual clients that thinks about data governance, data quality and data integration.

“I see BI System Builders as the small company going after big data,” said Russell. “That means that through our Cornerstone Solution we consider everything from business analysis and requirements to source system data.
“We can work with clients, big or small, to make sure that their business intelligence project runs smoothly.
“Our process is all about providing valuable and useful information and data for clients to use and analyse.”

The company moved into the Warwick Innovation Centre last November and since that time Russell has developed valuable links with the University of Warwick by building relationships with departments and leading a lecture in The Oculus building at the university.

Russell added: “Being based at the university’s science park has provided invaluable links and it is something that I will strive to continue. Karen Aston, centre manager at the Warwick Innovation Centre, was instrumental in helping to put me in touch with the university. This is what the science park centres are all about, it’s not just about having a base, there is advice, help and networking opportunities available for everyone.”

Karen Aston, centre manager at the Warwick Innovation Centre, said: “BI System Builders has been a great addition to our Warwick site. I am so pleased with how Russell and the company has fitted in with the innovation centre and we were only too happy to help build links between the company and the university. Our centres are not just here to provide a home for companies, we can offer a wide range of support services to benefit all.”

Posted on Leave a comment

Big Data Technology

Here’s a list of links to the big data technology websites. Several of the providers below offer license free or open source batch, analytical and data integration software. Our Cornerstone Solution® methods are software agnostic. You can combine open source software with the Cornerstone Solution® framework methods to get powerful and cost effective data solutions.

Hadoop
MapR
Pig
Sqoop
Flume
Spark
HBase
Hive
Avro
Parquet
Crunch
Zookeeper
Oozie
Cassandra
Impala
Talend
Pentaho
Connexica
Couchbase
Karmasphere
Hadapt
Neo Technology
Splunk
Hortonworks
Cloudera
Datameer
Platfora
Sisense
DataStax
Tableau
Tibco
LucidWorks
Acunu
MongoDB
Precog
YarcData
Kapow Software
Zettaset
Space-time Insight
ClearStory Data
AtScale
Grafana
Vertica
Cazena
MemSQL
Phemi
Snowflake
Syncsort DMX-h
Jethro

Languages
Perl
Ruby
Scala
Python
—–Theano
—–Pyspark
—–NumPy
—–SciPy
—–scikit-learn

Useful Related Technologies
HPCC
Storm
Sensu
Django
PostgreSQL
Graphite
AWS
Redshift
Flink
Kafka
MXNet
NiFi
Apache Beam
HornetQ
RabbitMQ
D3.js
RShiny
Leaflet
Kibana
Tensorflow
Scrapy
R
Rapidminer
H2O.ai

For Developing
GoLang
Node.js
Backbone.js
Jupyter

Cloud
GCP
Azure
AWS