Monday 12 August 2013

The Personal Corpus

Text Mining tools and algorithms are becoming increasingly sophisticated. Most people are unaware of what can be revealed from the data that they post to Facebook, or the data they submit to Google through searching and using Gmail. We are now in a position where the global internet corporations know more about each of us that those closest to us; indeed, with their analytic tools, they may know more about us than we do ourselves.

But text mining tools, whilst complex in their algorithms, are not rocket science. It wouldn't take much to provide these kinds of tools to ordinary learners and teachers. I'm finding the idea of empowering everyday users with sophisticated data mining tools increasingly attractive as a means of gaining greater personal autonomy in the face of global forces which are harvesting personal data for their own ends. So let's start with the idea of a "personal corpus"

A Personal Corpus is the sum total of the text you ever write. Emails, essays, tweets, etc. Everything goes in. Your data analytic tools can pick over it. You can do a particular kind of search with this sort of setup. Rather than say what you are looking for, instead you say what you think is the most important thing at a particular time: the "topic" of the moment. A "topic" is really a compression of  a lot of stories in a flow of information. What the analytic tools do is examine your "topic". It might look for occurrences of your topic in your corpus. But more importantly, it might look on the internet for other 'stories' relating to your topic. What emerges is a search corpus (drawn from the internet). The match between the search corpus and the personal corpus can then be calculated. It may be that the "topic" is something new; something you've never thought before. In this case, a process of recursive search can reveal sub-topics that might lead you from your chosen topic to the topics identified in your personal corpus. A path between your topic and the topics already in your corpus can be calculated.

So, for example, you think the most important thing at the moment is a "data mining". I have a personal corpus (this blog!) which I can search for this. But a key word search is less revealing than a search where I compared all the expanded definitions and stories around 'data mining' with the narratives I already have in my personal corpus. Here I can look at the depth of matching, identify associated terms, and explore the links between those associated terms and my corpus. So I can identify, for example, that in 2010 I was talking about something like this, and maybe I would want to revisit some of this work. To me, that is valuable because I've been redirected to look not at some resource on the internet, but at something that was already within me.

With a Personal Corpus, the relationship between the user and social software is reversed. Most social software tools are used for 'sharing' documents - social software serves as a repository. With the Personal Corpus, the internet and social software tools are used as a resource for data extraction; corpus data is not intended to be shared, but stored (maybe) locally in order to be analysed.

With the Personal Corpus, individual users can determine the likely impact of particular social messages, whilst at the same time be able to get an insight  into the value that companies like Google might extract from that data. But more importantly, it provides greater personal autonomy through allowing users to explore the likely impact of different kinds of intervention.

The Personal Corpus might be seen as an extension to E-portfolio tools (which never really took-off, did they?!) or to the Personal Learning Environment (which was hijacked by the axe-grinding blogeratti!) It might provide a way of really giving learners some useful tools which give them something back that might have some meaning for them...

6 comments:

David said...

Sounds like an August project to me. Are you in the office this week?

Mark Johnson said...

Yes - I agree! How about tomorrow or Wednesday?

David Sherlock said...

I can come in either day, just let me know when your around and I'll come up

David Sherlock said...

I built the basics of what we talked about yesterday:

http://www.youtube.com/watch?v=2UwbqMfL1jI&feature=youtu.be

You might need to full screen it. I think we need to talk to Adam/Ben/somebody with experience. But the idea is there!

Mark Johnson said...

Brilliant! This is exactly it! I agree we should talk to some people now...
There's a much bigger project in this.

David Sherlock said...

Jeff Horner, the guy behind rook,rapache and other various web interfaces to R did something about telling stories with data, R and Rapache but I never really followed it: https://github.com/jeffreyhorner