Initial plan/outline for Deep Match coming to this space. 

Preliminary Material: demonstration nugget

A very small version of the system will be described and created to 
demonstrate how it works and to explain its scalability and other 
strengths.

Constituent parts of demonstration nugget:

1. A set of data to use as an example of what needs to be found

for the simplest possible explanation we will use 5 data items,
each of which is a blob of text and nothing more, in this case
they are these blobs:

  (1) In the middle of a desert, many miles from civilisation, a lone cactus had just awoken from
      a terrifying nightmare and was surveying the moonlit landscape with mixed feelings of
      dread and fear. 

  (2) "You are not surprised when a fig tree brings forth figs, so why are you surprised when 
       the world produces its normal crop of happenings? A physician or shipmaster would be ashamed to 
       be surprised if a patient proved feverish or a wind contrary. No event can happen to an ox, 
       a vine, a stone or indeed a football team but what properly belongs to the nature of oxen, vines, 
       stones or football teams. If all that happens is natural then why should there ever be reason to complain?"

  (3)  Ra Ra El vanished into his seventy sixth vodka and Bob looked around him, wondering what 
       to do. A strange individual with dreadlocks came up to Bob and asked him if he knew 
       where to find the reference section. Bob said that he didn't. The dreadlocked one gave 
       him a book - it was some sort of biology reference manual and was called 'A complete 
       guide to the species Homo Sapiens'.

  (4)  A cow with red filaboots approached stealthily. As it drew nearer to Bob (and the 
       very small child) it began slowly to sway its head in a peculiar manner.

  (5)  The moog sighed deeply and then explained to Bob that the whole concept of life was
       pointless, and that life itself was made up of a series of events which were all pointless.
       However, as there were varying degrees of pointlessness, certain of these events seemed, in
       relation to others, to have a point, even though everything was in reality totally pointless
       and farcical.

(the files of data will be named raw1.txt, raw2.txt, raw3.txt, raw4.txt, raw5.txt

2. First script: remove nonword characters (eg commas, apostrophes, etc) and
   unnecessary spacing, etc, producing raw1proccesed1.txt

3. Second script: will produce raw1processed2.txt from raw1processed1.txt...

   it will take 10 words in a row, starting at the beginning, from the item:

eg with...

	In the middle of a desert, many miles from civilisation, a lone cactus had just awoken from
        a terrifying nightmare and was surveying the moonlit landscape with mixed feelings of
        dread and fear.

	it will first take 
	In the middle of a desert many miles from civilisation

	and then
	the middle of a desert many miles from civilisation a

	and then 
	middle of a desert many miles from civilisation a lone

	and then
	of a desert many miles from civilisation a lone cactus


	and with each of those (and all the others to come after it until it runs 
	out of words, so the final 10 will be less than 10 in size, each shrinking 
	step by step) it would do something like...

	with
	In the middle of a desert many miles from civilisation

	this...
	In the middle of a desert many miles from 
	In the middle of a desert many miles 
	In the middle of a desert many 
	In the middle of a desert
	In the middle of a 
	In the middle of 
	In the middle 
	In the 

	and...
	the middle of a desert many miles from civilisation
	middle of a desert many miles from civilisation
	of a desert many miles from civilisation
	a desert many miles from civilisation
	desert many miles from civilisation
	many miles from civilisation
	miles from civilisation
	from civilisation

	and then it would go through each of those and put each batch in all possible arrangements, eg...

	 many miles from civilisation

	would have 4 x 3 x 2 (ie 24) possible combinations,

	a desert many miles from civilisation

	would have 6 x 5 x 4 x 3 x 2 (ie 720 possible combinations)

	and all of that would be added, in bulk, to the processed file

				etc...

and so you create a VERY LARGE file of data, including many repeats/duplicates

4. With a linux shell command, order the list alphabetically and remove duplicates in a two-part swoop
   to produce raw1processed3.txt

5. Third script goes through every line and for it opens or creates a 'match-card'*, eg
   cards/a/ades/adesertm/adesertmanymilesfromcivili
   could be the name of a card inside a very large and very deep system
   and in it the number "1" would be added as a unqiue record,
   and if in raw2395processed3.txt the same phrase pops up it will be written
   to the file as 2395. Thus a search for that phrase will pull out those
   records, ie 1, 2395, etc. if the same term appears more than once for
   a document you can ignore it or you can add a tally mark to the result
   in some way, enabling you to perform a relevance re-ordering. there is not
   really a real significant difference in making top results 'more relevant'
   in either of these methods but the re-ordering method will 'appear' to be
   more mainstream and relevant to users and surface-level analysis. there
   are arguments for both choices (to come).



[then move on to next record and repeat - ie raw2.txt, etc]


6. the retrieval (search) script is very simple: take the input, open the card which fits the name,
   process the card's data any way you like and print it on the screen (or if you are lazy and
   wasteful, you could include the layout when you print to the card, so you literally print the
   card to screen as a page)

*match-card is a term i have invented to denote the nature of these objects. in computational terms
they are text files, or records in a database if you prefer, but there are better arguments for
text files and your own bespoke database languages, mainstream commercial activity is no challenger
of this fundamental reality

[this demonstration and document will continue to be worked on over winter 2013-2014, there is
much more to come both here and on Open Hobo]