25.1 How was my data collected?

2022-07-08 Do you actually know how your data was collected? We can live quite happily with our ignorance of how the data was actually collected, and assume it is all hunk dory. Until we begin trying to explain some oddities that our analysis has thrown up.

I have experienced this many times, and indeed continue to do so. We make assumptions only to learn later that we did not actually have a full picture of the world. So, we probably never will, but we should always be striving to learn a little more all the time.

Here’s an example. I wanted to analyse my household energy consumption. I have a backlog of a few years of PDF invoices from my electricity, gas, and water companies. The files are all nicely named following a scheme that includes the date, the company, and the amount and dollars of consumption, something like 20220708_actew_3245_132.pdf. That’s handy since I can pragmatically build a dataset of consumption by parsing the filenames. The invoices come through quarterly.

All well and good until for some reason the electricity consumption was flat for winter and yet spiked in spring. Odd, since it was a very cold winter and the household heater was on more than off it seemed. A little investigation found the reason. My meter is manually read each quarter, and from the reading I receive the bill. Okay, so the company has decided to save costs and every second quarter they now estimate the reading (or I can provide a reading to them if I like). If the estimate is based on last quarter then as winter hit, of course the estimate will be quite an under estimate.

I’ve seen this kind of situation in many data science projects. Different operators, for example, might interpret the codes they are entering differently, or the meaning of the codes may change over time. Some numeric data may be recorded as debit values for some time and then as credit values due to changes in the business processes. and so on.

So, ask questions, but as importantly always use our “living and breathing the data” to look for oddities. Don’t just ignore them as I see junior data scientists do. Investigate and understand. Your project outcomes may depend on it, and if you make these discoveries late, you might find that after 6 months, and deployment, you have caused harm.

Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2022 Graham.Williams@togaware.com Creative Commons Attribution-ShareAlike 4.0