MySQL Travails

This post is about some of the technical challenges we recently ran into while working on a new project. So, if technical details are not your cup of tea, this is a good blog to avoid.

However, if you work with MySQl, you may want to read-on.

At a very high level, all we wanted to do was to load some data into MySQL tables. It initially seemed pretty straightforward – since there were less than 4mm records with just about 20 fields. However, when we attempted to load the data, we got an error.

This may be obvious to all you seasoned MySQL gurus out there, but for usthis was a bit of a challenge.

“Error Code: 1300. Invalid utf8 character string: ‘”El Batey de Do’”.

When we checked in our csv file, the name is in a language other than English:

El Batey de Doña Provi Garden.

We thought we had the character set and collation set but it turned out that we needed to set it in a different way.

Luckily one of our senior developers was able to fix it in a short amount of time.

Machine Learning for business entity intelligence

Yesterday I had a discussion with a client (we will call him Seth) who leads the innovation team in a large company. Seth has a broad mandate to determine how to extract intelligence from unstructured data. We discussed how data available on the web and social media can be used as sources to gain insights about business entities.

Conceptually, this is all doable. Overall, we believe that this can be done with a multi-step approach including

  • Identify companies that have relevant business entity data (e.g. Amazon, Etsy, Facebook, LinkedIn, Foursquare, Yelp, Alibaba etc)
  • Use API access, screen scraping or other means to extract business entities and product data from them
  • Cleanse, parse and standardize data that has been extracted
  • Match against existing databases to identify new business entities or existing ones
  • Extract intelligence from the consolidated information obtained so far

However, as expected, there are several challenges in each step

For example, some social media websites (e.g. Twitter, Foursquare) have APIs to enable programmatic access while many others do not or restrict access (e.g. Facebook, LinkedIn) – so getting access to the data itself may be a challenge

Secondly, given the wide variety of data available, coming up with rules for cleansing, parsing and standardization may be a daunting challenge. However, here machine learning techniques can possibly be used to automatically generate and maintain such rules.

Machine learning can also be used in the final step i.e. extracting intelligence from the processed information extracted with the matched database.

Seth was excited by our approach and said that he would like to go ahead with a POC pending approval from senior leaders

While we work through a proposal development phase with Seth, I will plan to provide a couple of concrete examples of the workflow that we proposed

As usual, feel free to provide feedback, ask questions and leave comments on this post


What is a datapreneur?

You may have heard of solopreneur. However, you may be wondering what exactly is a datapreneur?

Some of you may be thinking – that’s easy – its a an entrepreneur who works with data. Yes – but in this case its a little more complex than that.

To me D.A.T.A. is an acronym – it stands for Data Analytics Technology Algorithms.

I am an entrepreneur running a tiny company on the East Coast. We help companies improve their bottom line by using data, analytics, technology or algorithms.

I wanted to have a place where I could share my daily experiences without worrying too much about quality of writing.

So this is my place to share my “stream of consciousness”. I welcome questions, comments and feedback – so feel free to jump in a discussion.