Posted on

Cloud Computing & Big Data

Google Cloud Platform

If you are considering a journey on to Google Cloud Platform, BI System Builders would love you to engage us to help you, so please get in contact. We’re on a similar journey ourselves and you can read about it below.

BI System Builders’ data, or at least a chunk of it, is migrating to Google Cloud Platform (GCP). Having had the opportunity to study towards achieving Google Professional Data Engineer and getting involved with Google Cloud Platform at client site I could not help but be impressed with the capabilities (data warehousing, big data, data pipelines, and machine learning) available on the platform. I’ll explain…

Like many businesses BI System Builders has a requirement to analyse its data for effective decision making. A big proportion of that data is collected via the company website and social media platforms. Now because there is a lot of it (data) and its nature tends to be unstructured or semi-structured data the obvious candidate is Apache Hadoop. Why Hadoop?

Hadoop is designed to handle high volumes of unstructured and semi-structured data so it was going to meet my needs. However, the idea behind Hadoop was not to build a supercomputer for the job but to split (shard) and distribute that data across several computers in a cluster.

‘“In pioneer days they used oxen for heavy pulling and when one ox couldn’t budge a log, they didn’t try to grow a larger ox. We shouldn’t be trying for bigger computers, but for more systems of computers” – Grace Hopper’ (Tom White, 2015, Hadoop The Definitive Guide – Storage & Analysis at Internet Scale, O’Reilly, 4th Edition, p3).

My first port of call on the Hadoop journey was the Apache Hadoop website where I downloaded the Hadoop software. I quickly realised that the set up was not going to be straight forward and that I would need to acquire additional coding and tuning skills; something that I had neither the time nor appetite for. Also, I was going to need to buy, configure, and importantly maintain the new computers and code. This amounted to extra costs in terms of hardware and labour. My second port of call was to take a free and pre-configured version of Hadoop in the form of a sandbox from Hortonworks and run it on a VM to gain more familiarity. This was partially successful but I still ran into technical problems and of course all I had was a sandbox to experiment with, not something that I needed for the company’s live data.

Third port of call was the cloud and I considered the big three, Amazon AWS, Microsoft Azure, and Google Cloud Platform. I was initially pulled towards Azure because it’s closely tied to MS SQLServer, a RDBMS that I’d been involved with for the best part of twenty years. However, it was when I started using Google Cloud Platform that my progress and ideas started to accelerate.

So, does GCP include Hadoop? Oh yes, and what may not be understood is that Google have been in the game from early on. Hadoop makes use of a file system storage capability known as HDFS (Hadoop Distributed Filesystem). Distributed file systems can be traced back to a paper published by Google in 2003. Then in 2004 Google published a paper introducing MapReduce. Is MapReduce important? Yes, because if you are going to shard your data across numerous computers you are going to want to join it back together again to be consumed, and that is the function of MapReduce.

If you log on to Google Cloud Platform you will not see a button to start Hadoop. But you will see a managed service called Dataproc. When you configure and initiate Dataproc you are starting a cluster that includes Hadoop, Spark, Hive and Pig (members of the Hadoop ecosystem). For me this is a big win:

  1. I do not need to configure and manage the Hadoop ecosystem – Google does that for me – I can focus on my objectives of analytics and Machine Learning
  2. I do not need to buy additional hardware, I effectively rent the hardware from Google
  3. I do not need to maintain (or pay someone to maintain for me) big data software or the hardware it’s installed on
  4. I only pay for what I use. If my Hadoop/Spark job takes three hours to run then I rent the VMs for three hours. I don’t pay for them when I’m not using them
  5. Without becoming too technical if I run Hadoop with HDFS I need to have the machines on all the time if I need HDFS to persist my data. On GCP I can persist my data in Google Cloud Storage and power down the VMs when they have finished processing so I don’t pay for their use
  6. Google seem to have done a great job of integrating the different components of the platform together so it’s not a problem to move data out of Dataproc to BigQuery (the SQL based Google query engine) or into Tensorflow (ML) or Google DataStudio for analytics

As well as the benefits above there are platform components that cover IOT (machine sensor data), Pub Sub for message streaming including microservices, Dataflow for streaming and batch processing and loading data, Bigtable (NoSQL) for real time data and Cloud Spanner which is horizontally scalable for very large data warehousing deployments, and of course undertaking data science. A very good read on data pipelines and machine learning  is Valliappa Lakshmana, 2018, Data Science on the Google Cloud Platform – Implementing Real-Time Data Pipelines: From Ingest To Machine Learning, O’Reilly.

So, is there a downside to GCP? You may feel that security is an issue. GCP has a strong security component known as IAM (Identity and Access Management) which to me seems at least as good as that found in SAP BusinessObjects which has stood the test of time. But you are trusting your data to a Google owned and managed data centre. In some ways that’s no different to a lot of large companies that host their data in data centres that they do not own. But if you feel uncomfortable you can also take a hybrid approach of cloud/on prem. In the hybrid model some of your data will be hosted on the cloud and other (perhaps sensitive) data can be hosted on your premises.

Also, someone at some point, will need to code python or java for something like Dataflow (Apache Beam), Spark or Tensorflow (ML). The coding gives a lot of flexibility but in some ways this coding feels like a step backwards from some of the modern ETL tools like SAP Data Services, Informatica, and Microsoft’s SSIS. With those ETL tools you drag and drop components to create your data pipeline rather than having to develop and maintain reams of code. Out of the box functionality reduces the risk of human error and supporting esoteric code.

You can move all your data warehousing capabilities to GCP but keep in mind that many of the challenges of data do not go away just because a bunch of new technologies capable of handling vast amounts of data are in play. I’m referring to solution and data architecture, data quality, data profiling, data cleansing, data transformations, data modelling, transaction updates and tracking history. BI System Builders have an end to end service framework for data management solutions known as Cornerstone Solution®. Cornerstone Solution® addresses all the above challenges. It also fits GCP very well and is currently being redesigned to accommodate data lakes and big data concepts. Have a successful journey!

Posted on

University of Warwick Science Park Press Release – July 2017

A successful partnership is being forged between a university and a business intelligence specialist that’s going after big data.

Click here to read the article.

BI System Builders is discovering the huge benefits of being based at the University of Warwick Science Park’s Warwick Innovation Centre.
Founded by Russell Beech in 2009, the firm designs and builds business intelligence systems across a whole host of sectors. Since it first started, BI System Builders has worked with a number of high-profile clients including Volkswagen Group, Dixons Retail and Vision Express.

The company has developed the Cornerstone Solution, which is an expert application of BI System Builders’ set of services, techniques and knowhow that have been developed over the past 20 years. The Cornerstone Solution was created by Russell and is a programme that is customised to individual clients that thinks about data governance, data quality and data integration.

“I see BI System Builders as the small company going after big data,” said Russell. “That means that through our Cornerstone Solution we consider everything from business analysis and requirements to source system data.
“We can work with clients, big or small, to make sure that their business intelligence project runs smoothly.
“Our process is all about providing valuable and useful information and data for clients to use and analyse.”

The company moved into the Warwick Innovation Centre last November and since that time Russell has developed valuable links with the University of Warwick by building relationships with departments and leading a lecture in The Oculus building at the university.

Russell added: “Being based at the university’s science park has provided invaluable links and it is something that I will strive to continue. Karen Aston, centre manager at the Warwick Innovation Centre, was instrumental in helping to put me in touch with the university. This is what the science park centres are all about, it’s not just about having a base, there is advice, help and networking opportunities available for everyone.”

Karen Aston, centre manager at the Warwick Innovation Centre, said: “BI System Builders has been a great addition to our Warwick site. I am so pleased with how Russell and the company has fitted in with the innovation centre and we were only too happy to help build links between the company and the university. Our centres are not just here to provide a home for companies, we can offer a wide range of support services to benefit all.”

Posted on

Big Data Technology

Here’s a list of links to the big data technology websites. Several of the providers below offer license free or open source batch, analytical and data integration software. Our Cornerstone Solution® methods are software agnostic. You can combine open source software with the Cornerstone Solution® framework methods to get powerful and cost effective data solutions.

Hadoop
MapR
Pig
Sqoop
Flume
Spark
HBase
Hive
Avro
Parquet
Crunch
Zookeeper
Oozie
Cassandra
Impala
Talend
Pentaho
Connexica
Couchbase
Karmasphere
Hadapt
Neo Technology
Splunk
Hortonworks
Cloudera
Datameer
Platfora
Sisense
DataStax
Tableau
Tibco
LucidWorks
Acunu
MongoDB
Precog
YarcData
Kapow Software
Zettaset
Space-time Insight
ClearStory Data
AtScale
Grafana
Vertica
Cazena
MemSQL
Phemi
Snowflake
Syncsort DMX-h
Jethro

Languages
Perl
Ruby
Scala
Python
—–Theano
—–Pyspark
—–NumPy
—–SciPy
—–scikit-learn

Useful Related Technologies
HPCC
Storm
Sensu
Django
PostgreSQL
Graphite
AWS
Redshift
Flink
Kafka
MXNet
NiFi
Apache Beam
HornetQ
RabbitMQ
D3.js
RShiny
Leaflet
Kibana
Tensorflow
Scrapy
R
Rapidminer
H2O.ai

For Developing
GoLang
Node.js
Backbone.js
Jupyter

Cloud
GCP
Azure
AWS