Sunday, 7 July 2019

The Preservation of Context in Machine learning

I'm creating a short online course to introduce staff in my faculty to machine learning. It's partly about awareness-raising (what's machine learning going to do to medicine, dentistry, veterinary science, psychology, biology, etc?), and partly about introducing people to the tools which are increasingly accessible and available for experimentation.

As I've pointed out before, these tools are becoming increasingly standardised, with the predominance of python-based frameworks for creating machine learning models. Of course, python presents a bit of a barrier - it's so much better if you can do it in the web, and indeed, if you could do it in a desktop app based on web technologies like Electron.js. So that is what I'm working on.

Putting machine learning tools in the hands of ordinary people is important. The big internet corporations want to present a message that only they have the "big data" and expertise sufficient to really handle the immense power of AI. I'm not convinced. First of all, personal data is much "bigger" that we think, and secondly, our machine learning tools are hungry for data partly because we don't fully understand how they work. The real breakthrough will come when we do understand how it works. I think this challenge is connected to appreciating the "bigness" of personal data. For example, you could think of your 20 favourite films, and then rank them in an order. How much information is there there?

Well (without identifying my favourites), we have
F
B
C
A
E
... etc

Now if we consider that every item in the ranking is a relation to every other item, then the amount of data is actually the number of permutations of pairs of items. So,

F B
F C
F A
F E... and so on
That's 20!/(20-2)!, or 380 rows of data from a rank list of 20 items.

So could you train an algorithm to learn our preferences? Why not?
Given a new item, can it have a guess as to which rank that item might be? Well, it seems it can have a pretty good stab at it.

This is interesting because if the machine learning can estimate how "important" we think a thing is, and we can then refine this judgement in some way (by adjusting its position), then something is happening between the human and the machine: the machine is preserving the context of the human judgement which is used to train it.

The predominant way machine learning is currently used is to give an "answer": to identify the category of thing a item is. Yet the algorithm that has done this has been trained by human experts whose judgements of categories is highly contextual. By giving an answer, the machine strips out the context. In the end, information is lost. Using ranking, it may be possible to calculate how much information is lost, and from there to gain a deeper understanding of what is actually happening in the machine.

Losing information is a problem in social systems. Where context is ignored, tyranny begins. I think this is why everyone needs to know about (and actively engage with) machine learning.

No comments: