fox88, on 15 June 2020 - 08:47 PM, said:
stoatwblr, on 14 June 2020 - 03:05 AM, said:
There are a lot of us who share "lots of files"
Those
lots just are ignoring the natural limits.
Reasonable numbers are lower by about two orders of magnitude.
[/quote]
Define reasonable numbers. It isn't 1998 anymore. Back then 100GB was larger than existing hard drives (the first 100GB drives showed up in 2000)
Quote
stoatwblr, on 14 June 2020 - 03:05 AM, said:
More prosaically on the work side I've been trying to find a distributed way of dealing with sharing literally hundreds of millions of astronomy files totalling several PB (anon FTP is going away for various reasons) and nothing scales
Continuing the analogy, it is like trying to load the whole cargo train into a Mini car.
What is wrong with FTP?
[/quote]
You mean apart from the massive security holes (no admin will let you do sftp to a public site), the fact that it's increasingly filtered to the point that even passive mode barely works and the fact that it means a single source for things? (same hassle for http atrchives)
NASA and ESA archives are purposefully limited to 100Mb/s connections to the outside world to limit their bandwidth.
Various collaborating groups are proposing making new P2P protocols to handle this and I don't see much point in reinventing the wheel
Quote
stoatwblr, on 15 June 2020 - 06:43 PM, said:
Emule doesn't _really_ need to load up every single file name and hash into memory at startup before doing anything else, then announce them all
That can be a lower priority background process and it can space out the announcements.
What that
spaced out eMule does when it receives several requests for 25+ GB files that required hashing?
Play nice music and say,
Please wait for an hour, files already have being queued for hashing!?
A hint: there is no such message in data protocol.
No hint needed or required. known2/known.met load up pretty quickly if nothing's shared at startup. The problems start when the directory tree is walked.
The _real_ problem here is the IO overhead of lstat/opening/closing files (plus seek time on mechanical drives) not the actual hashing time. You can lose 8-20ms _per file_ (even more over NFS or SMB) and this doesn't scale with size - it's the same with small files as it is with large ones. Once a file's opened, reading is as fast as the disk will feed it.
The problem is that during this scanning sequence the entire computer ends up hanging on the IO subsystem file open() or lstat() calls and _this_ is what makes the program plus overall system unresponsive - so there are very real benefits in adopting some form of slow-start/slow-scan philosophy
(I spent a lot of time benchmarking IO latencies due to issues with network /home in clustered filesystems servicing several hundred clients. It was eye opening - the delays are not where you may think they are and there are several "2^^n" scaling problems with latencies as directories grow in size - a directory with 32500 files might be 5 times faster to scan than one with 33000 files, etc, depending on the filesystem (NTFS is particularly susceptable to this issue but linux filesystems are similar and FAT32 runs a very real risk of data loss at directory sizes exceeding 4096 files as well as incurring a 20x slowdown past 512 entries/directory)
Given that you already have the hashes, you can use those to determine which files to access first - and just like your startup scan you can then lstat (etc etc) them to ensure they haven't changed before announcing them (slow scan or listening for KAD request matches) or accepting a TCP connection - in the latter case if it has changed then the answer is "no", rehashing is triggered and the file announced like any new hash.
The point is that kad and ed2k accouncements and requests are periodic so refusing/ignoring a request isn't a problem. You just deal with it in your own time and announce when ready.
It's far better than crippling the host with naive scanning algorithms which simply _do not scale_ - your "natural limits" are based on assumptions which are inherently flawed because there is no need to load up everything at once. This isn't DOS, multithreading is normal these days and processes which monopolise a system (either by computation or tying up the IO channels) are frowned on.