Language Option In Search
#1
Posted 27 January 2005 - 12:12 AM
#2
Posted 27 January 2005 - 12:52 PM
#3
Posted 27 January 2005 - 01:46 PM
#4
Posted 27 January 2005 - 05:57 PM
for that reason some uploaders put [eng], [esp], etc....in their files names
#5
Posted 08 March 2005 - 03:41 PM
This is going to be a feasibility study for the time being. If all goes well it, technically as well as content-wise, it will become a feature request in the end.
There are many videos in the ed2k network that are available in a lot of different languages. So eMule users are facing the challenge how to find a file in the most appropriate language without having to download each and every file of related content, thus slowing down the network with unnecessary transfers.
The only information about languages in videos (be that audio streams or subtitles) currently available is the file name. Now we all know that file names can be changed arbitrarily; what's more, many fansubbers don't care to specify the language in their file names, they rather name their own fansubber group codes there, convinced that the users of their language will know these codes from some web sites/forums of the corresponding content. But what about users of other languages? For these a fansubber group code is almost worthless, and if there is no translation of the video which is obviously in the user's primary language the best he/she can do is try to download all insufficiently labelled variants and check for themselves. Obviously this is a waste of bandwidth - I had to do this myself a lot of times, unfortunately.
eMule's internal search provides a number of parameters, such as the video codec. But it does not allow to restrict a search to one (or a list of) language(s), which would be the solution to the problem above.
This posting is an attempt to find out what would have to be done as to provide such an additional search parameter.
I don't know how the codec information (which is already available as a search parameter) is derived from the video containers. But I'm positive that it must be some ed2k client analyzing the file and extracting this kind of information, as the lugdunum servers don't have access to the file content - they only receive the information supplied by the client when publishing a share. (We'll need this fact later, see below.)
So there must already be some code to parse video containers in eMule or related client software. Let's start from here, because...
...last week I happened to stumble over the project coordinator of the Matroska project (Christian) in some German forum, and we discussed this issue there. What he told me was this:
1. The Matroska video container provides language tags for all kinds of streams, be that video, audio or subtitle streams.
2. The Matroska project planned to provide an eMule "plugin" to do exactly what I want but officially it must be done by the eMule coders, not by the Matroska project.
3. The Matroska source code is Open Source, Public Domain (some GNU type license), written in C++ and compiles in MSVC, GCC and mingw. So it might be rather compatible for use in eMule, aMule etc.
4. There is some tool named mkvinfo for extracting information of this kind from a Matroska container. This is a commandline tool but it might serve as an example program how to access said language tags in a Matroska container.
5. Currently there is no API specifically for the requirements of eMule, most likely because there is no specification for any such requirement. One result of this thread might be writing down an API specification and then asking the Matroska project to provide some DLL for this API, which would limit the amount of Matroska specific code in eMule to the absolute minimum. This way the Matroska code could be embedded into eMule in a similar way as the gzip compression code is currently embedded by importing the "zlib" library.
I don't know how important the Matroska format will be in the future, but what Christian told me during this discussion sounded encouraging. That's why I would like to start with Matroska as a "proof of concept", knowing this has to be extended to as many other container formats as possible.
I have little experience with video file containers, thus I don't know in how far language tags are available in AVI, OGM and WMV containers, leave alone all the MPEG1/2 variants (VCD/SVCD/MVCD/XVCD). If anyone has specific knowledge about this then please post it here - I will try to ask experts about these formats in case my idea would be accepted in general.
Now let's assume some Matroska language tags parser were already embedded into eMule. What else would we need?
If the information derived from this parser were to be used as parameter in a lugdunum server search then this information were to be sent to the eserver while publishing a share.
Thus the protocol between eMule and eserver would have to be extended, and the feature would only be available for combinations of a future version of eMule connected to a future version of eserver. As both products appear to have a relatively short product cycle this doesn't look to be a real problem, it is just that we would have a migration phase until the full benefit of the feature can be derived.
But what would happen in case an older version of eserver received a "new style" query including a language description? We would not want this parameter to be ignored by the eserver and the client be flooded by invalid results from eservers with an older version, in case of a "global search". I guess we need Lugdunummaster to make a statement about this problem before we can go any more into details - or maybe eMule could restrict the feature to a server search for the time being? (Or to a required minimum version number of eserver?)
There's one more protocol change necessary - the format in which eMule is sending a search request to a lugdunum server would now have to allow for some language parameter. My impression is that restricting the language-related search to exactly one language might be too restrictive, so I'd suggest using some list of values. The exact format of transferring this list would probably have to be specified by Lugdunummaster.
Because of these protocol changes there should be a thorough discussion and an exact specification of the feature; noone would want to have this feature changing throughout a number of versions, and have to support different formats. This might be done within this thread, or directly between the eMule coders and the eserver coders if they prefer it this way.
Most of the above would apply for the Kademia search as well. Kademlia's protocol would have to be extended for both the share attributes as well as the query parameters. But this could be done any time later, and independently from what I describe in this posting. Then again, if there were any problems with implementing the protocol changes between eMule and eserver, the eMule project could do it on their own, thus providing an additional feature in the Kademlia search. (In fact I don't expect any such problems, this is only for the record - it might just be nice to know that there would be a bypass around some external dependency, and the feature itself won't die in case something bad happens to the servers.)
One more missing part of the implementation: The eMule search GUI would then have to provide an additional field to enter the language(s) within the search form, and an additional column within the result window as to display the languages received within (at least) one search result (I might be searching for German and English versions of a video but prefer to download only the German version if there is any). This part seems to be relatively trivial.
I would prefer to get separate result columns for video, audio and subtitle language tags. Christian told me that language tags of video streams would probably not make that much sense but they're technically possible in Matroska - and in case of a hard-subbed video (being the only stream within a video container) this would be all we can get, as there would not be any separate subtitle stream in this case (which is what AVI containers would be like, for example, if they happen to contain language tags as well).
On the other hand I'd probably not want separate input fields for video, audio and subtitle languages. Of course this would allow for a more specific query, and it might be a good idea to specify the eMule <-> eserver protocol in a way that allows for these queries. But I don't think many users would be able to make use of such a feature, and many would rather be confused instead. As the results would display the kind of tag where the required language was detected, and the user can see this before starting a download, I guess this would suffice and be easier to use.
Ah, I almost forgot an important issue. What about the encoding to be used for languages? There is more than one way to represent a code.
IIRC the language files of eMule ar using a code where German (from Germany) is "de_DE". This appears to be what is used for languages by operating systems.
Unfortunately Matroska is using a different standard to encode languages, named ISO 639-2. Which would mean that eMule either needs to use this code to represent content languages (depending on the codes of other video containers this might be more or less reasonable, which is another reason why we should examine those container format before going too much into details), or convert between the probably more popular two-letter language codes (which might well be ISO 639-1 from the same table but I'm not sure about this) and whatever code a specific video container might happen to be using.
The eMule coders would probably know which type of input widget for language codes they'd consider most user friendly - maybe some dropdown menu to append one language code of whatever standard to the list in the search form field? I'm aware this would require some language specific translation table for the dropdown menu entries but this would not be that much different from other language specific parts of the eMule GUI.
Of course the whole feature would only make sense if the creators of the video containers tagged their videos appropriately.
But I am convinced we would be able to spread the news of the availability of a language-based search feature in one of the most popular file sharing tools rather quickly amongst them, and tagging their containers appropriately might even become a cachet for video containers then.
I probably forgot a number of aspects of the project (it it were ever to become one, that is). Any comments, additions etc. are welcome, including those who explain to me why the whole idea might not work.
This post has been edited by Devil Doll: 09 March 2005 - 11:10 AM
Unresolved bugs in MorphXT (as of v8.13): Sorting search results by "Known" column, Sorting "Files" page by "Transferred Data" column
#6
Posted 28 April 2005 - 02:10 PM
i agree with your feature request, and i like to view that feature implemented in emule, but i think that is very dificult, almost imposible, to achieve that.
1) you are charging to the releasers to specify the languaje of the file that they are inserting
2) you push MATROSKA (that i think a marvelous container) as default video file for releasers
3) avi file is the most spreaded video format... what can we do with that files?
4) you are requesting a change in the lugdunum server and emule-kad protocol...
i think that extending a system like 'commentaries' for details about the file, we have the necesary... obviously, people can lie... but, can do you sure that the releaser dont lie about his file with your system.
i push the idea of devil-doll for a languaje search, but not in the way that he/she purpouse. i push the idea of 'extending' commentaries to set the languaje of a file/quality/codec used/subtitles.
#7
Posted 28 April 2005 - 02:40 PM
f5inet, on Apr 28 2005, 03:10 PM, said:
f5inet, on Apr 28 2005, 03:10 PM, said:
f5inet, on Apr 28 2005, 03:10 PM, said:
Anyone more into container formats, please post information about language tags in other containers to this thread!
f5inet, on Apr 28 2005, 03:10 PM, said:
f5inet, on Apr 28 2005, 03:10 PM, said:
We need video container file format experts to participate in this discussion - I'm not one of them, unfortunately.
Unresolved bugs in MorphXT (as of v8.13): Sorting search results by "Known" column, Sorting "Files" page by "Transferred Data" column
#8
Posted 28 April 2005 - 02:50 PM
Devil Doll, on Apr 28 2005, 02:40 PM, said:
f5inet, on Apr 28 2005, 03:10 PM, said:
i don't know that, i apologize...
perhaps directly extracting directshow languaje data is the only necesary, don't you?
too, the releasers must be spoiled about especify the correct languaje in their avi-files...
#9
Posted 28 April 2005 - 07:43 PM
That's called evolution with backward compatibility
#10
Posted 19 June 2005 - 01:38 AM
Let's hope that any later implementation will cover language tags as well.
If the eMule coders would want to implement anything themselves they might consider some open source implementation like MediaInfo.
Unresolved bugs in MorphXT (as of v8.13): Sorting search results by "Known" column, Sorting "Files" page by "Transferred Data" column
#11
Posted 19 July 2005 - 01:44 AM
Devil Doll, on Apr 28 2005, 06:40 AM, said:
Math is delicious!
MmMm! Mauna Loa Milk Chocolate Toffee Macadamias are little drops of Heaven ^_^
Si vis pacem, para bellum DIE SPAMMERS DIE!
#12
Posted 08 August 2005 - 02:43 PM
release notes v0.46c said:
----------------------
.: Added support for MediaInfo DLL versions 0.6.1 and 0.7.x [Thx to Zenitram]
Unresolved bugs in MorphXT (as of v8.13): Sorting search results by "Known" column, Sorting "Files" page by "Transferred Data" column
#13
Posted 27 November 2005 - 05:23 PM
That said, if you do plan on trusting the stream names, it's easy to extract them using direct show. Just feed the file into the wrapper parser, and read the pin names. It's easy to tell apart audio from video using stream flags, but you could probably identify that one just by name. You could also support sub-title stream names that way.
That said, with the protocol it's pretty trivial. You just add another meta-tag. I'm not sure, but I think the protocol already allows search filtering by meta-tags.
So adding a "limit search by pin name" is possible, but don't expect it to really find what you're after.
SlugFiller rule #1: Unsolicited PMs is the second most efficient method to piss me off.
SlugFiller rule #2: The first most efficient method is unsolicited eMails.
SlugFiller rule #3: If it started in a thread, it should end in the same thread.
SlugFiller rule #4: There is absolutely no reason to perform the same discussion twice in parallel, especially if one side is done via PM.
SlugFiller rule #5: Does it say "Group: Moderators" under my name? No? Then stop telling me about who you want to ban! I really don't care! Go bother a moderator.
SlugFiller rule #6: I can understand English, Hebrew, and a bit of Japanese(standard) and Chinese(mandarin), but if you speak to me in anything but English, do expect to be utterly ignored, at best.
#14
Posted 28 November 2005 - 02:26 PM
Devil Doll, on Apr 28 2005, 06:40 AM, said:
SlugFiller, on Nov 27 2005, 09:23 AM, said:
That doesn't sound like what he's suggesting at all. I thought he was referring to the wLanguage field in the AVISTREAMHEADER struct. Still, that's still probably going to suffer from the same oblivion that the stream names currently suffer from. I doubt many encoders make use of that field for its intended purpose. AVI is such a fractured format anyway. It's amazing that it was supported this long.Math is delicious!
MmMm! Mauna Loa Milk Chocolate Toffee Macadamias are little drops of Heaven ^_^
Si vis pacem, para bellum DIE SPAMMERS DIE!
#15
Posted 28 November 2005 - 07:17 PM
PacoBell said:
Devil Doll said:
Devil Doll said:
In any way, as anyone who has ever used DX's GraphEdit would know, every wrapper codec provides stream names, and most wrappers allow them to be custom(I believe that includes AVI). However, most movies do not have intelligible stream names. These are often restricted to multi-lingual movies, and even then it's not always present.
Yes, it's possible to add search by stream name. Shouldn't be too difficult.
However, I wouldn't bank on it having the result you'd expect.
Quote
If you mean a parser using DirectShow, s/he wouldn't know, since DirectShow is there to avoid having to know the file formats. It abstracts anything.
If you're talking about the parser used by DirectShow, as in the codec, that would be Microsoft. Their file format, their DirectShow, their parser.
Fortunately, I can give you the answer, since DX's GraphEdit allows getting DirectShow to be a bit more verbose. While I couldn't find nothing about a "wLanguage", the streams did prove to have custom names. Here are a few:
-Stream 01
-01) Audio Track
-01) ???????(1
-Raw Audio
This post has been edited by SlugFiller: 28 November 2005 - 07:18 PM
SlugFiller rule #1: Unsolicited PMs is the second most efficient method to piss me off.
SlugFiller rule #2: The first most efficient method is unsolicited eMails.
SlugFiller rule #3: If it started in a thread, it should end in the same thread.
SlugFiller rule #4: There is absolutely no reason to perform the same discussion twice in parallel, especially if one side is done via PM.
SlugFiller rule #5: Does it say "Group: Moderators" under my name? No? Then stop telling me about who you want to ban! I really don't care! Go bother a moderator.
SlugFiller rule #6: I can understand English, Hebrew, and a bit of Japanese(standard) and Chinese(mandarin), but if you speak to me in anything but English, do expect to be utterly ignored, at best.
#16
Posted 28 November 2005 - 11:57 PM
SlugFiller, on Nov 28 2005, 11:17 AM, said:
PacoBell said:
Devil Doll said:
Devil Doll said:
Quote
VirtualDubMod 1.5.1.1a Release Notes said:
You can still use user-defined languages (and overcome the standard) ...
AVIMaster said:
If you've got an eDonkey/eMule *.part and *.part.met file, avimaster file.part will extract the original file name from the accompanying file.part.met and copy all valid frames into the destination file. The first part (header) must have been downloaded already. Note: If only small parts of a large file are downloaded, it may take quite a while to scan the whole file.
One thing I learned from using VidTrace is that the wLanguage tag seems to be optional in the spec, so that's probably why it didn't show up in GraphEdit for your particular sample. HTH.
[EDIT]
Another source of language information can be found in the CSET chunk under the wLanguage & wDialect tags, although the VirtualDub guy had this to say about it:
VirtualDub Guy said:
Whoops! In my haste, I overlooked something important. Apparently, the CSET chunk applies only to the original RIFF standard and not to AVI, which although it was derived from RIFF, seems to have deprecated this part. Phooey, indeed.
[/EDIT]
This post has been edited by PacoBell: 29 November 2005 - 12:57 AM
Math is delicious!
MmMm! Mauna Loa Milk Chocolate Toffee Macadamias are little drops of Heaven ^_^
Si vis pacem, para bellum DIE SPAMMERS DIE!
#17
Posted 29 November 2005 - 12:22 AM
A crazy, but simple idea. What about every mule publishing their own GUI language setting? Ok, I know, maybe you like films in a foreign language, and English films are being shared all over the world, but I think that statistically the most 'voted' language for a hash should be significant. Most people like their mule in its own language, and most users share content its own language.
Crazy?
This post has been edited by asturcon3: 29 November 2005 - 12:24 AM
#18
Posted 29 November 2005 - 12:39 AM
asturcon3, on Nov 28 2005, 04:22 PM, said:
...unless you're into anime where the language can be japanese, korean, or even english! That's why we have subtitles. Now, parsing the RIFF structure won't pick up on hardcoded subs, but if the proper subtitle chunk format is adhered to, it will be detected as such. Unfortunately, the vast majority of encoders (and I'm referring to the programs, not the people) don't strictly follow the set standards, so they end up breaking many rigidly compliant parsers out there. Even VirtualDub had to accomodate a few "quirks" in order to get it to work with most AVI files in the wild.Math is delicious!
MmMm! Mauna Loa Milk Chocolate Toffee Macadamias are little drops of Heaven ^_^
Si vis pacem, para bellum DIE SPAMMERS DIE!
#19
Posted 29 November 2005 - 09:35 PM
Quote
I myself, was suspecting it was reserved, from MS's specifications(it was always "0" in their examples). Let me know if you find an AVI with one set.
Quote
Samples. I looked at around 20 files. Most seem to call the streams "Stream 00" for video and "Stream 01" for audio. Must be encoder default.
Let me know how the language tag fares.
Quote
As you've mentioned, you could get pretty random results with movies, not to mention anime.
If you have country flags, you can check out your sources, though I've found that they rarely match the language.
SlugFiller rule #1: Unsolicited PMs is the second most efficient method to piss me off.
SlugFiller rule #2: The first most efficient method is unsolicited eMails.
SlugFiller rule #3: If it started in a thread, it should end in the same thread.
SlugFiller rule #4: There is absolutely no reason to perform the same discussion twice in parallel, especially if one side is done via PM.
SlugFiller rule #5: Does it say "Group: Moderators" under my name? No? Then stop telling me about who you want to ban! I really don't care! Go bother a moderator.
SlugFiller rule #6: I can understand English, Hebrew, and a bit of Japanese(standard) and Chinese(mandarin), but if you speak to me in anything but English, do expect to be utterly ignored, at best.
#20
Posted 30 November 2005 - 03:51 AM
Emule can just add a bunch of footer bytes at the end of the file saying what the language is (and other data). It's quite possible this will not break the file in any way, and will simply be ignored by the applicaiton which reads the file. But emule will be able to read this info. (I.e. add bytes to the end, and change the file's size, without worrying about the file format specs at all.)
While hashing the file, emule could exclude those bytes from the calculation, such that files with the footer and without will be properly recognized as the same file. (also chopped off from the file size info which helps to identify the file, for backward compatibility, but with a new info added to the protocol saying if there's an emule extension and how many bytes).
It's a hack. But there's a chance it will work for many file types, and will keep backward compatibility and support all sloppy releasers who will never add the language info in the proper DirectShow var (they would need a special application to put that info into their file; current apps don't do that, I think).
How I see it:
I download a video file that has no language info. I add that info from within emule. The info is sent with other metadata of the file to the server, so other users will see it when they search. If the releaser is not an idiot, he would do the same in emule when releasing his file.
There's no fear that this data will be lost or cut off when the file is read, and no one will write/re-write this file except emule and the original releaser. The only question is whether this will break the file playback. I even thought how emule can quickly know if the file has a special footer or not: place a special code at the very end of the file, preceded by the number of bytes of the footer, then read and interpret the footer the way you would a header. So it can be a variable-sized footer for future expansion.
I might be totally off track. Maybe it's worth investigating.
This post has been edited by xylo9: 30 November 2005 - 04:15 AM










Sign In
Register










