Search This Blog

Monday, August 31, 2015

The Seven C's of Analysis.

Feeling adrift in the topic of analytics and data science?

Earth with Seven Seas Oceans

There are two types of people in the world - Those who categorize and those who do not. There are many categorizations around analytics. Model based, Predictive, etc...

Five cases or levels of analysis - Decriptive, Diagnostic, Discovery, Predictive and Prescriptive - are often listed. Analyses are done on Big Data, Small Data, Rich Data and so on. Data gets classified by Velocity, Volume, Variety, Veracity and Value (5V).

What is the definition of analysis?
:Examination of structure as a basis for interpretation: is a fair working definition. The process or action definition is tantamount to taking inputs and ideas and turning them into decision and deliverables.

So how does analysis work? In the real and theory world? Past, present and future? Note that even asking the question involves a bit of analysis - deciding that time, space, quality and value are important parameters in our discussion.

Back to the seven C's. (three Chi's and four K's). Three are old:

Cheat: Know the answer for certain apriori. Like the Sting.  Or like doing the Kaggle Titanic competition from a download of the historical list of survivors. Or edX or Coursera students who use a shadow account to get the answers.

Chance: Guess the answer randomly... Not based on any "knowledge".

Chestnuts: Use rules of thumb, experts and one's gut. Experts can be consultants, practitioners, or highest paid person opinions (HIPPO). 

The next four are actually what most would call based in data science.

Coordinations: Statistics and regressions against well known dependent variables like space, time and value. f(t,x,y,z).. Note the nuance versus correlation below.

Correlations: f(v) for any vector v. - emergence - and not just related to time and space connections. Anything can be connected to anything else... in any way.

Crowd: Gather wisdom from many actors and models. Maybe using rule of thumb, or gut or any of the others, but getting it right(er) by law of averaging. Boosting might even be clumped into this crowd - so to speak.

Continuous: Iterate and apply the above in real world. Not quite like the others, this is about how the analytics get generated and acted upon. Most of the above answer the call when the need arises. They do not have "instinct" building (aka a built-in instinct) nor a process per se.

Wednesday, August 19, 2015

Nova Scotia Omiyage Gift Items.

Trip to Nova Scotia generally means bringing back gift items characteristic of area. Usually go with Coffee Crisp or Aero bars. Or Smarties. Sometimes even Cherry Blossom, Crispy Crunch, or Crunchie.
But if you do not have a sweet tooth there are alternatives:

Green Tomato Chow Chow.

One that has recently come up - good for almost anyone - Dulse.

Dulse Palmaria Palmata snack seaweed

Ketchup flavoured potato chips are often a favourite.
There are also local specialties like gouda from That Dutchman.
Items from Jost Vineyards.

All in spirit of omiyage.

Friday, August 7, 2015

Anna Talk. Charlie Thoughts.

The development of the kids is always interesting. Anna, two, can repeat almost everything we say (even long complicated sentences with long complicated words). She has memorized the text of books, and can "read"  them. She knows the numbers up to four or five, and can point them out (as the cardinality of a group) or count them out.

hand counting to three
  
Charlie, four, is interested in superheros, remote control, and trains. He has started to ask about hard concepts like death, jealousy and social exclusion. He recalls events and dreams, and recounts them understandably.

Thursday, August 6, 2015

Data Science. Feature Engineering. Sustainable Value. Seeking.

Had some exposure to data science long ago in undergraduate and graduate school:
Mapping spatial spin dynamics in helium fluids with NMR, studying positron annihilation in CuTi alloys, and coming up with good deposition recipes for making smooth gold substrates (as flat backgrounds against which to do scanning tunnelling microscopy of biologics and large chain molecules) were three that come immediately to mind.

Got some sense of theoretical underpinnings of data science from mentors at the math department at Dalhousie University. See that Dalhouse now has a data science department and will be hosting an international conference KDD-2017Physics Department engendered a fondness for a hands on approach. Got exposure to a wide variety of data and modeling techniques at the Condensed Matter Physics department at Cornell. Was away from data science for a long time on pathways of sensors and instrumentation as things in and of themselves. And business issues around production, logistics and customer support (of what amounts to instrumentation software). But analytics of large streams of data (sensors and tags in the Internet of Things) has brought me back to data science.

Have been taking the MIT edX course on data science. Looked at many things that this pointed to. Have become a fan of the materials at kdnuggets. And have been thinking hard about feature engineering. Note the feature engineering article is a very lean in Wikipedia, which belies its importance.

The Machine Learning Mastery article on discovering feature engineering by Jason Brownlee referenced  is well worth reading (well written and math does not overwhelm).

There are many articles about how one can excel at data science (and win Kaggle competitions *grin*) by using some core principles of collecting and understanding the data, feature engineering, applying standard or non-standard data science techniques, boosting (or model combinations).

There are very interesting new companies and business models which leverage the application of data science to seek "treasure". Some have new an wonderful techniques (like Ayasdi) and many have great tools to make a data scientist's life easier like data base stuff from Deep Information Sciences, and automatic model selection and combination like in Azure ML and IBM offers.

Cannot help but think that this all only gets one so far. The value of a model has to be harvested by deploying that model into the real world with real world constituents, and few have ventured there (or have simply taken it for granted). Further a good model often begs more data.

Those are things we have thought hard about in Analytika. One can construct a cycle:
  • A. get and understand data
  • B. feature engineer
  • C. model and transcend
  • D. deploy (get ongoing data, get ongoing analytics results)
  • E. harvest business value
  • F. goto A.
Thinking about how anywhere along the line we might realize a new feature in existing data or ask for....
"Can we get a new temperature sensor?" "Do you have data on turbidity?" "Do you have a flow sensor on x". State of the art is a long way from a computer or AI asking, "Do you have any data on how bright the clothing was for the Titanic survivor? How tall was each passenger?". Human inquiry and framing seems sustainably key.