Light On Searching

(2 Pages)
1
2
→

#1 ule

Splendid Member

Group: Members
Posts: 113
Joined: 22-September 06

Post icon Posted 29 January 2007 - 03:02 PM

I can't fathom how Searches work.

Sometimes I can't find a file unless I put in its whole name - though putting in an exact but incomplete version of it only turns up a few dozen unrelated results. For other items, putting in the whole exact title reveals nothing but putting in only half of it comes up trumps. How come ? Posted Image

DARK ENERGY = ANTI-MATTER X INERTIA OF ETERNAL NIGHT SQUARED

#2 leuk_he

MorphXT team.

Group: Members
Posts: 5975
Joined: 11-August 04

Posted 29 January 2007 - 03:11 PM

2 question back...

You have fixed the fake server problem already (since you are not a complete newbe i assume yes)

And you will have to specify what kind of seath you are talking about.
-Kad -> first word determines where it will look.
-Server -> only local server is searched.
-Global -All servers are searched until .....

Download the MorphXT emule mod here: eMule Morph mod

Trouble connecting to a server? Use kad and /or refresh your server list
Strange search results? Check for fake servers! Or download morph, enable obfuscated server required, and far less fake server seen.

Looking for morphXT translators. If you want to translate the morph strings please come here (you only need to be able to write, no coding required. ) Covered now: cn,pt(br),it,es_t,fr.,pl Update needed:de,nl
-Morph FAQ [English wiki]--Het grote emule topic deel 13 [Nederlands]

if you want to send a message i will tell you to open op a topic in the forum. Other forum lurkers might be helped as well.

#3 ule

Splendid Member

Group: Members
Posts: 113
Joined: 22-September 06

Posted 29 January 2007 - 04:06 PM

leuk_he, on Jan 29 2007, 03:11 PM, said:

2 question back...

Yes and Global - haven't explored this issue properly with other methods yet

You sound like you suppose this phenomenon is personal to me....??

DARK ENERGY = ANTI-MATTER X INERTIA OF ETERNAL NIGHT SQUARED

#4 niclights

lost in space

Group: Members
Posts: 10288
Joined: 01-November 04

Posted 29 January 2007 - 07:12 PM

No. It is not unique to you or because of fake servers, although is always worth checking first

One important thing to note is that global searches will end after approx 100 sources total have been found but I am guessing this is not what you are confused by. Being specific about search strings does make a difference but I don't fully understand the mechanism.

To learn exact details of how (server) searches work you need to investigate the server software (lugdunum) itself. Have a search around for links etc.

This post has been edited by niclights: 29 January 2007 - 07:14 PM

#5 ule

Splendid Member

Group: Members
Posts: 113
Joined: 22-September 06

Posted 29 January 2007 - 07:35 PM

niclights, on Jan 29 2007, 07:12 PM, said:

To learn exact details of how (server) searches work you need to investigate the server software (lugdunum) itself. Have a search around for links etc.

Thanks tho' slighty unstraightforward coz it's not open source Posted Image

- where might be the best place to ask?

DARK ENERGY = ANTI-MATTER X INERTIA OF ETERNAL NIGHT SQUARED

#6 lugdunummaster

Golden eMule

Group: Member_D
Posts: 1040
Joined: 19-September 02

Posted 29 January 2007 - 10:30 PM

Hi

By default eserver tries to perform keyword searches.

If you search : "full example"

Then eserver takes the word that is less common in filenames.
"full" appears in 106527 filenames,
"example" appears in 1312 filenames.

So eserver will pickup the index on "example", then will try to match substrings "full" and "example" in the filenames of files linked to this index.

So for this query it should match "fulle example.avi" but not "full examples.avi"

Is that clear ?

If you want eserver to take a string as a substring and not as a index keyword, just append a '*' after the word :

"full example*"

In this case, eserver will chose the index on "full" and try to match"full" and "example".
It should match "full examples" or "examples this is full" but NOT "fulle example"

If you try "basic* example*" then eserver will try a full scan search. (no keywords). But these kind of searches are likely to be automatically stopped because it takes too much time. (you can imagine searching in millions of filenames). On big servers, those searches are guaranted to be aborted very soon. (time to perform a full scan on gigabytes of memory is just ... insane)

So *try* to always include at least one keyword in your search queries. The more keywords are given, the more eserver has a chance to choose a short list and answer in some micro second.

Note : some very common words are not indexed because too frequent in filenames : the, of, mp3, avi, jpg, mpg, rar, zip, you, in, by ...

#7 ule

Splendid Member

Group: Members
Posts: 113
Joined: 22-September 06

Posted 30 January 2007 - 12:25 AM

lugdunummaster, on Jan 29 2007, 10:30 PM, said:

Is that clear ?

Yes thanks, this does explain how putting in too few words can shut down a search before it strikes gold, but in that case why wouldn't 300 unlooked for results show up in the process?
Also how can the opposite happen in some other searches: putting in the exact full name proves fruitless while a cut down version does the trick?

Quote

Note : some very common words are not indexed because too frequent in filenames : the, of, mp3, avi, jpg, mpg, rar, zip, you, in, by ...

What about digits?

DARK ENERGY = ANTI-MATTER X INERTIA OF ETERNAL NIGHT SQUARED

#8 lugdunummaster

Golden eMule

Group: Member_D
Posts: 1040
Joined: 19-September 02

Posted 30 January 2007 - 06:58 AM

Quote

It really depends on client too. Global Searches can be canceled by emule if it thinks it got enough results.
Some servers may changed their config to send more or less results too (300 is the default setting for a HighID client, a LOWID client gets 150 only)

Quote

What about digits?

You probably meant numbers.

Small numbers between 00 and 19 are not indexed too.

Also, there is a minimum keyword length, default 2, but tunable in configuration file.

#9 ule

Splendid Member

Group: Members
Posts: 113
Joined: 22-September 06

Posted 30 January 2007 - 02:56 PM

Thanks for all that, here's a few more loose ends:

lugdunummaster, on Jan 29 2007, 10:30 PM, said:

..................So *try* to always include at least one keyword in your search queries.

From what you write there can't be more than one keyword - so do you refer to potential keywords - words entered without an * ?

lugdunummaster, on Jan 30 2007, 06:58 AM, said:

Also, there is a minimum keyword length, default 2, but tunable in configuration file.

Do you mean the server's config file? What might it be retuned to? If this is common, is there some way to determine which server's do what? If you run a global search on a server with min.3 will you get only min.3 results communicated about files on other servers configured with min.2 or min.4?

DARK ENERGY = ANTI-MATTER X INERTIA OF ETERNAL NIGHT SQUARED

#10 leuk_he

MorphXT team.

Group: Members
Posts: 5975
Joined: 11-August 04

Posted 30 January 2007 - 03:27 PM

ule, on Jan 30 2007, 03:56 PM, said:

Thanks for all that, here's a few more loose ends:

lugdunummaster, on Jan 29 2007, 10:30 PM, said:

..................So *try* to always include at least one keyword in your search queries.

From what you write there can't be more than one keyword - so do you refer to potential keywords - words entered without an * ?

No (as i understand... :angelnot:

) you CAN enter more keyword in the client. however the most rare keyword must be an exact match. (you cannot see at the client what keyword that is,sometime some common sense can help you.). So in effect you are searching always for one keyword , and then a filter is applied. (like the local search.....).

Spell your rare specific words correct!

:worthy:

lugdunummaster (i would have guessed myself the indexes would have been merged like a relational database)

if you want to send a message i will tell you to open op a topic in the forum. Other forum lurkers might be helped as well.

#11 niclights

lost in space

Group: Members
Posts: 10288
Joined: 01-November 04

Posted 30 January 2007 - 08:56 PM

@Lug: Really useful stuff! Explains some of the things I didn't understand though I am still digesting it.

Some things I still would like to know:

Global searches ending after approx 100 sources. Is this server-side, client or am I talking nonsense? (blaming Jestheonlyone if so!)

How are characters such as '-' and '_' treated (when immediately adjacent to alphanumeric chars)?

A pre-FR discussion: Would it be useful to know which word(s) were used by the server as keywords after performing a search? Is this something the server could easily notify the client about?

#12 ule

Splendid Member

Group: Members
Posts: 113
Joined: 22-September 06

Posted 30 January 2007 - 09:26 PM

niclights, on Jan 30 2007, 08:56 PM, said:

A pre-FR discussion: Would it be useful to know which word(s) were used by the server as keywords after performing a search? Is this something the server could easily notify the client about?

This would be nice, if they came out in bold perhaps, though it is pretty easy to recognize, once you are lucky enough to be wise to the foregoing knowhow, that the word that comes out only as is and never as part of another always be the key

DARK ENERGY = ANTI-MATTER X INERTIA OF ETERNAL NIGHT SQUARED

#13 lugdunummaster

Golden eMule

Group: Member_D
Posts: 1040
Joined: 19-September 02

Posted 30 January 2007 - 10:20 PM

Quote

Global searches ending after approx 100 sources. Is this server-side, client or am I talking nonsense? (blaming Jestheonlyone if so!)

When emule performs a global search, it sends the query to each known server.
Each server may answer (or may not since the query uses UDP protocol, not guaranted to work every time) and giv up to 30 answers (default value of the eserver side configuration parameter maxUDPSearchCount)
Once emule has enough answers it may cancel the loop (sending one query to one server every second)
So 100 is not server side.
A 'local' server gives much more answers (because TCP session permits a zlib compression)
As a global search starts by sending a TCP search to the local server, emule may get many answers from its local server, thus stoping its loop and not send any UDP message to other servers.

Quote

How are characters such as '-' and '_' treated (when immediately adjacent to alphanumeric chars)?

Those are separators, like space, dot, <, >, { } [ ] and other special chars...

However eserver handles some kind of strings differently... thats tricky.

For example if you search "zlib 1.2.3" then eserver will match the full sequences "zlib" and "1.2.3", not "zlib", "1" and "2" and "3".
This helps finding given versions for example.

Quote

A pre-FR discussion: Would it be useful to know which word(s) were used by the server as keywords after performing a search? Is this something the server could easily notify the client about?

Quite tricky because if your search request is "word1 OR word2 OR (word3 AND substring) AND SIZE>500000", then eserver will probably use several keywords (ie using several lists instead of one list)
I suppose eserver could add an attribute to each result telling wich keyword was used, but do we care, and is it worth the extra bandwidth ?

#14 niclights

lost in space

Group: Members
Posts: 10288
Joined: 01-November 04

Posted 30 January 2007 - 10:42 PM

Thanks. That all makes sense!

Quote

However eserver handles some kind of strings differently... thats tricky.

For example if you search "zlib 1.2.3" then eserver will match the full sequences "zlib" and "1.2.3", not "zlib", "1" and "2" and "3".

I assume there is fixed set of strings for when eserver chooses to include these separators in the search?

Quote

I suppose eserver could add an attribute to each result telling wich keyword was used, but do we care, and is it worth the extra bandwidth ?

Well, that's why I asked!

#15 lugdunummaster

Golden eMule

Group: Member_D
Posts: 1040
Joined: 19-September 02

Posted 30 January 2007 - 10:57 PM

Quote

I assume there is fixed set of strings for when eserver chooses to include these separators in the search?

What I described is not related to the separator, but length of words (only the space is really special and always taken as a full separator)

ie if you search "next.karate" or "next karate" or "next-karate" eserver will perform the same job (ie finding
"next karate.avi" and "next.karate.avi" and "karate.next.avi"

But if you search "1.2.3.zlib", this wont find the same results than "zlib.1.2.3" or "zlib 1.2.3" or "zlib 1 2 3"

Maybe one day I will document the mechanism... because it's really tricky.

#16 niclights

lost in space

Group: Members
Posts: 10288
Joined: 01-November 04

Posted 31 January 2007 - 12:56 AM

Ok. I understand. All very useful information

I could go on to ask about the length of these words, but is not important. It is clear how the mechanism is intended to work now but I can see why it would be difficult to document!

It does make all the difference when you are trying to find either the extremely rare or when you are trying to find the file that is swamped under similar, but much more popular, files despite judicious use of filters & operators!

One last question. Search strings within quotation marks. Would I be correct in thinking that an attempt to force the matching of a whole string ie. ' "a-rare_file" ' will search for exactly that, including the hypen and underscore? Also, in a similar example but without the special chars ie. ' "a rare file" ', would this be treated as a full scan search and fall under the limitations you described earlier?

Oh, while you're here and we're talking server search... I remembered this: http://forum.emule-p...howtopic=110707 about NOT operators on filehash. I'm just curious. The uses for this are very limited and rare, but not worthless. Was this not implemented because of the overhead (as Kry points out) or because it could be exploited and cause huge load or just for no particular reason?!

Many thanks!

#17 lugdunummaster

Golden eMule

Group: Member_D
Posts: 1040
Joined: 19-September 02

Posted 31 January 2007 - 02:34 PM

Quote

One last question. Search strings within quotation marks. Would I be correct in thinking that an attempt to force the matching of a whole string ie. ' "a-rare_file" ' will search for exactly that, including the hypen and underscore? Also, in a similar example but without the special chars ie. ' "a rare file" ', would this be treated as a full scan search and fall under the limitations you described earlier?

If you put quotation mark around a string, like :
"zlib 1.2.3" AND zip

then eserver will :

- Attempt to extract at least one keyword from the whole request (and choose the 'best' one if several are found), in order to be able to use its internal indexes. Said keyword can be found inside the string that is quoted, or not...

- eserver will examine all filenames attached to selected keywords and try to match the full string "zlib 1.2.3" (without quotation marks of course) somewhere in the filename. So it will match "Zlib 1.2.3.4.zip" but not "zlib-1.2.3.zip" and not "zlib 1.2.3.zip" (several spaces, or different separators)

Note : Case is not important in searches.

So the general answer to your question is "Yes", minus the case

Note : eserver handles a search cursor per client, so that if the index list(s) is(are) not exhausted, a flag is sent to emule saying 'More results are available, please let user press the 'More' button if he wants). If the user presses the 'More' button, then emule sends back a special small request saying 'please send me more results'.
Eserver has to store some state in order to be able to re-start the search logic at the point it was hold.

About your question on NOT operators : They are handled by eserver, like many other operators (AND, ANDNOT, OR, XOR). A search query is a tree that can be quite complex. Each client choose to implement part of possibles trees (mainly a look and feel problem for users)

#18 ule

Splendid Member

Group: Members
Posts: 113
Joined: 22-September 06

Posted 31 January 2007 - 04:01 PM

Thanks - Truly Enlightening - Maybe this should be made into some kind of Sticky while we await the great documentation

DARK ENERGY = ANTI-MATTER X INERTIA OF ETERNAL NIGHT SQUARED

#19 buzz

Golden eMule

Group: Members
Posts: 860
Joined: 25-December 02

Posted 31 January 2007 - 07:23 PM

@lug:

While you're at it:
What's the difference between *example and example* ?
I remember, that you once stated that *example does a substring search. But i get different result with the two queries

Thank you for your work...

#20 lugdunummaster

Golden eMule

Group: Member_D
Posts: 1040
Joined: 19-September 02

Posted 31 January 2007 - 09:00 PM

Quote

While you're at it:
What's the difference between *example and example* ?
I remember, that you once stated that *example does a substring search. But i get different result with the two queries

Interesting question :flowers:

Yes, the * can be at the end or the begining of a word to ask for a substring search. So there is no difference.

But as your requests have no keywords, eserver performs a 'full scan search' on every filenames it knows.

As it is a potentially long request, eserver may decide to cancel it. The cancellation may happen if the number of outstanding requests reach a given threshold.
As eserver is multi-threaded, a thread servicing a 'full scan search' may notice the threshold is reached and stop, providing the accumulated results in the answer packet.

In this case, your emule displays less results but the 'More' button should be available : Eserver did stop the search but store the cursor so that user may try to get more results. As this stuff really depends on the load of the eserver, you may get more results from a small server (few clients but available cpu cycles)

The key is : eserver may decide to not fully handle a 'full scan search request'. All other requests are handled without looking at the load level.

Member Options

(2 Pages)
1
2
→

Official eMule-Board: Light On Searching - Official eMule-Board