Light On Searching
#1
Posted 29 January 2007 - 03:02 PM
Sometimes I can't find a file unless I put in its whole name - though putting in an exact but incomplete version of it only turns up a few dozen unrelated results. For other items, putting in the whole exact title reveals nothing but putting in only half of it comes up trumps. How come ?
#2
Posted 29 January 2007 - 03:11 PM
You have fixed the fake server problem already (since you are not a complete newbe i assume yes)
And you will have to specify what kind of seath you are talking about.
-Kad -> first word determines where it will look.
-Server -> only local server is searched.
-Global -All servers are searched until .....
Trouble connecting to a server? Use kad and /or refresh your server list
Strange search results? Check for fake servers! Or download morph, enable obfuscated server required, and far less fake server seen.
Looking for morphXT translators. If you want to translate the morph strings please come here (you only need to be able to write, no coding required. ) Covered now: cn,pt(br),it,es_t,fr.,pl Update needed:de,nl
-Morph FAQ [English wiki]--Het grote emule topic deel 13 [Nederlands]
if you want to send a message i will tell you to open op a topic in the forum. Other forum lurkers might be helped as well.
#3
#4
Posted 29 January 2007 - 07:12 PM
One important thing to note is that global searches will end after approx 100 sources total have been found but I am guessing this is not what you are confused by. Being specific about search strings does make a difference but I don't fully understand the mechanism.
To learn exact details of how (server) searches work you need to investigate the server software (lugdunum) itself. Have a search around for links etc.
This post has been edited by niclights: 29 January 2007 - 07:14 PM
#5
Posted 29 January 2007 - 07:35 PM
niclights, on Jan 29 2007, 07:12 PM, said:
Thanks tho' slighty unstraightforward coz it's not open source - where might be the best place to ask?
#6
Posted 29 January 2007 - 10:30 PM
By default eserver tries to perform keyword searches.
If you search : "full example"
Then eserver takes the word that is less common in filenames.
"full" appears in 106527 filenames,
"example" appears in 1312 filenames.
So eserver will pickup the index on "example", then will try to match substrings "full" and "example" in the filenames of files linked to this index.
So for this query it should match "fulle example.avi" but not "full examples.avi"
Is that clear ?
If you want eserver to take a string as a substring and not as a index keyword, just append a '*' after the word :
"full example*"
In this case, eserver will chose the index on "full" and try to match"full" and "example".
It should match "full examples" or "examples this is full" but NOT "fulle example"
If you try "basic* example*" then eserver will try a full scan search. (no keywords). But these kind of searches are likely to be automatically stopped because it takes too much time. (you can imagine searching in millions of filenames). On big servers, those searches are guaranted to be aborted very soon. (time to perform a full scan on gigabytes of memory is just ... insane)
So *try* to always include at least one keyword in your search queries. The more keywords are given, the more eserver has a chance to choose a short list and answer in some micro second.
Note : some very common words are not indexed because too frequent in filenames : the, of, mp3, avi, jpg, mpg, rar, zip, you, in, by ...
#7
Posted 30 January 2007 - 12:25 AM
lugdunummaster, on Jan 29 2007, 10:30 PM, said:
Yes thanks, this does explain how putting in too few words can shut down a search before it strikes gold, but in that case why wouldn't 300 unlooked for results show up in the process?
Also how can the opposite happen in some other searches: putting in the exact full name proves fruitless while a cut down version does the trick?
Quote
What about digits?
#8
Posted 30 January 2007 - 06:58 AM
Quote
Also how can the opposite happen in some other searches: putting in the exact full name proves fruitless while a cut down version does the trick?
Some servers may changed their config to send more or less results too (300 is the default setting for a HighID client, a LOWID client gets 150 only)
Quote
You probably meant numbers.
Small numbers between 00 and 19 are not indexed too.
Also, there is a minimum keyword length, default 2, but tunable in configuration file.
#9
Posted 30 January 2007 - 02:56 PM
lugdunummaster, on Jan 29 2007, 10:30 PM, said:
From what you write there can't be more than one keyword - so do you refer to potential keywords - words entered without an * ?
lugdunummaster, on Jan 30 2007, 06:58 AM, said:
Do you mean the server's config file? What might it be retuned to? If this is common, is there some way to determine which server's do what? If you run a global search on a server with min.3 will you get only min.3 results communicated about files on other servers configured with min.2 or min.4?
#10
Posted 30 January 2007 - 03:27 PM
ule, on Jan 30 2007, 03:56 PM, said:
lugdunummaster, on Jan 29 2007, 10:30 PM, said:
From what you write there can't be more than one keyword - so do you refer to potential keywords - words entered without an * ?
No (as i understand... ) you CAN enter more keyword in the client. however the most rare keyword must be an exact match. (you cannot see at the client what keyword that is,sometime some common sense can help you.). So in effect you are searching always for one keyword , and then a filter is applied. (like the local search.....).
Spell your rare specific words correct!
lugdunummaster (i would have guessed myself the indexes would have been merged like a relational database)
Trouble connecting to a server? Use kad and /or refresh your server list
Strange search results? Check for fake servers! Or download morph, enable obfuscated server required, and far less fake server seen.
Looking for morphXT translators. If you want to translate the morph strings please come here (you only need to be able to write, no coding required. ) Covered now: cn,pt(br),it,es_t,fr.,pl Update needed:de,nl
-Morph FAQ [English wiki]--Het grote emule topic deel 13 [Nederlands]
if you want to send a message i will tell you to open op a topic in the forum. Other forum lurkers might be helped as well.
#11
Posted 30 January 2007 - 08:56 PM
Some things I still would like to know:
Global searches ending after approx 100 sources. Is this server-side, client or am I talking nonsense? (blaming Jestheonlyone if so!)
How are characters such as '-' and '_' treated (when immediately adjacent to alphanumeric chars)?
A pre-FR discussion: Would it be useful to know which word(s) were used by the server as keywords after performing a search? Is this something the server could easily notify the client about?
#12
Posted 30 January 2007 - 09:26 PM
niclights, on Jan 30 2007, 08:56 PM, said:
This would be nice, if they came out in bold perhaps, though it is pretty easy to recognize, once you are lucky enough to be wise to the foregoing knowhow, that the word that comes out only as is and never as part of another always be the key
#13
Posted 30 January 2007 - 10:20 PM
Quote
Each server may answer (or may not since the query uses UDP protocol, not guaranted to work every time) and giv up to 30 answers (default value of the eserver side configuration parameter maxUDPSearchCount)
Once emule has enough answers it may cancel the loop (sending one query to one server every second)
So 100 is not server side.
A 'local' server gives much more answers (because TCP session permits a zlib compression)
As a global search starts by sending a TCP search to the local server, emule may get many answers from its local server, thus stoping its loop and not send any UDP message to other servers.
Quote
Those are separators, like space, dot, <, >, { } [ ] and other special chars...
However eserver handles some kind of strings differently... thats tricky.
For example if you search "zlib 1.2.3" then eserver will match the full sequences "zlib" and "1.2.3", not "zlib", "1" and "2" and "3".
This helps finding given versions for example.
Quote
Quite tricky because if your search request is "word1 OR word2 OR (word3 AND substring) AND SIZE>500000", then eserver will probably use several keywords (ie using several lists instead of one list)
I suppose eserver could add an attribute to each result telling wich keyword was used, but do we care, and is it worth the extra bandwidth ?
#14
Posted 30 January 2007 - 10:42 PM
Quote
For example if you search "zlib 1.2.3" then eserver will match the full sequences "zlib" and "1.2.3", not "zlib", "1" and "2" and "3".
Quote
Well, that's why I asked!
#15
Posted 30 January 2007 - 10:57 PM
Quote
What I described is not related to the separator, but length of words (only the space is really special and always taken as a full separator)
ie if you search "next.karate" or "next karate" or "next-karate" eserver will perform the same job (ie finding
"next karate.avi" and "next.karate.avi" and "karate.next.avi"
But if you search "1.2.3.zlib", this wont find the same results than "zlib.1.2.3" or "zlib 1.2.3" or "zlib 1 2 3"
Maybe one day I will document the mechanism... because it's really tricky.
#16
Posted 31 January 2007 - 12:56 AM
I could go on to ask about the length of these words, but is not important. It is clear how the mechanism is intended to work now but I can see why it would be difficult to document!
It does make all the difference when you are trying to find either the extremely rare or when you are trying to find the file that is swamped under similar, but much more popular, files despite judicious use of filters & operators!
One last question. Search strings within quotation marks. Would I be correct in thinking that an attempt to force the matching of a whole string ie. ' "a-rare_file" ' will search for exactly that, including the hypen and underscore? Also, in a similar example but without the special chars ie. ' "a rare file" ', would this be treated as a full scan search and fall under the limitations you described earlier?
Oh, while you're here and we're talking server search... I remembered this: http://forum.emule-p...howtopic=110707 about NOT operators on filehash. I'm just curious. The uses for this are very limited and rare, but not worthless. Was this not implemented because of the overhead (as Kry points out) or because it could be exploited and cause huge load or just for no particular reason?!
Many thanks!
#17
Posted 31 January 2007 - 02:34 PM
Quote
If you put quotation mark around a string, like :
"zlib 1.2.3" AND zip
then eserver will :
- Attempt to extract at least one keyword from the whole request (and choose the 'best' one if several are found), in order to be able to use its internal indexes. Said keyword can be found inside the string that is quoted, or not...
- eserver will examine all filenames attached to selected keywords and try to match the full string "zlib 1.2.3" (without quotation marks of course) somewhere in the filename. So it will match "Zlib 1.2.3.4.zip" but not "zlib-1.2.3.zip" and not "zlib 1.2.3.zip" (several spaces, or different separators)
Note : Case is not important in searches.
So the general answer to your question is "Yes", minus the case
Note : eserver handles a search cursor per client, so that if the index list(s) is(are) not exhausted, a flag is sent to emule saying 'More results are available, please let user press the 'More' button if he wants). If the user presses the 'More' button, then emule sends back a special small request saying 'please send me more results'.
Eserver has to store some state in order to be able to re-start the search logic at the point it was hold.
About your question on NOT operators : They are handled by eserver, like many other operators (AND, ANDNOT, OR, XOR). A search query is a tree that can be quite complex. Each client choose to implement part of possibles trees (mainly a look and feel problem for users)
#18
#19
Posted 31 January 2007 - 07:23 PM
While you're at it:
What's the difference between *example and example* ?
I remember, that you once stated that *example does a substring search. But i get different result with the two queries
Thank you for your work...
#20
Posted 31 January 2007 - 09:00 PM
Quote
What's the difference between *example and example* ?
I remember, that you once stated that *example does a substring search. But i get different result with the two queries
Interesting question
Yes, the * can be at the end or the begining of a word to ask for a substring search. So there is no difference.
But as your requests have no keywords, eserver performs a 'full scan search' on every filenames it knows.
As it is a potentially long request, eserver may decide to cancel it. The cancellation may happen if the number of outstanding requests reach a given threshold.
As eserver is multi-threaded, a thread servicing a 'full scan search' may notice the threshold is reached and stop, providing the accumulated results in the answer packet.
In this case, your emule displays less results but the 'More' button should be available : Eserver did stop the search but store the cursor so that user may try to get more results. As this stuff really depends on the load of the eserver, you may get more results from a small server (few clients but available cpu cycles)
The key is : eserver may decide to not fully handle a 'full scan search request'. All other requests are handled without looking at the load level.