LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)
-   -   Search in Linux (https://www.linuxquestions.org/questions/linux-newbie-8/search-in-linux-4175563118/)

reemcs 01-05-2016 03:51 AM

Search in Linux
 
Hello ..

I have a website , in my website code there is a link (e.g. ....../myfolder/file001.pdf) this link is used to access a file in the linux system.
My question is: How the linux search for a file in the folder? especially if we have million of files inside the folder.
I want to know the searching method? Is there a special algorithm, or it will go through all the files (file by file) and chack the file name?
Or it will use the find method in linux?


Thanks in advance.

jpollard 01-05-2016 04:31 AM

It depends on what you are referring to.

Searching for a specific file, searching for a file with a partial name?

If "how does linux locate the file to read" is the question then the answer depends on the filesystem used. Most of the current filesystems (ext4 for instance) are using trees to store the directory, and the file name itself is hashed to make a short key to make quick matches - if the short key matches, then it verifies by using the full name.

It is still a bit slow if the directory has "million of files".. There are also other problems caused by that (slower to backup... harder for people to scan the file names, longer search when you don't know the exact name, too many names when performing file maintenance...)

reemcs 01-05-2016 05:10 AM

Quote:

Originally Posted by jpollard (Post 5473257)
It depends on what you are referring to.

Searching for a specific file, searching for a file with a partial name?

If "how does linux locate the file to read" is the question then the answer depends on the filesystem used. Most of the current filesystems (ext4 for instance) are using trees to store the directory, and the file name itself is hashed to make a short key to make quick matches - if the short key matches, then it verifies by using the full name.

It is still a bit slow if the directory has "million of files".. There are also other problems caused by that (slower to backup... harder for people to scan the file names, longer search when you don't know the exact name, too many names when performing file maintenance...)



-------------------------------------
Thank you very much.

Yes .. Searching for a specific file in the system.

Each Linux system support one filesystem or more than one filesystem?

Do you have a resource (website or document) that explain this in details? specially the delay part (Slow).

ondoho 01-05-2016 06:07 AM

Quote:

Originally Posted by reemcs (Post 5473251)
in my website code there is a link (e.g. ....../myfolder/file001.pdf)

can you show us the html code?
what does the "......" stand for?

reemcs 01-05-2016 06:31 AM

Quote:

Originally Posted by ondoho (Post 5473295)
can you show us the html code?
what does the "......" stand for?

Sorry .. I did't write the code till now. I just want to know if I have the exact path for a file how the Linux works to locate/find this file.
I'll start the system next week.

jpollard 01-05-2016 06:37 AM

Quote:

Originally Posted by reemcs (Post 5473272)
-------------------------------------
Thank you very much.

Yes .. Searching for a specific file in the system.

Each Linux system support one filesystem or more than one filesystem?

Many. The Ext family contains 4 versions, (one has been dropped), Ext2,3,4 are closely related, Ext3 is an extension of Ext2 and has journaling supported. Ext4 is an extension of ext3 and includes the ability to support large block allocations to reduce the amount of metadata used. An Ext3 filesystem can be mounted Ext2... and ext4 can be mounted ext2/3 BUT once Ext4 has files with large block allocations (happens with large files) the new files show up as a corrupted filesystem when that is done.

xfs is another filesystem (from SGI) that is also designed to handle large files and large filesystems. It is also used a a base for a cluster filesystem from SGI (cxfs) that has proprietary parts.

There is jfs from IBM, riserfs (not used as much now) with alternate data segments like HFS from Apple..

btrfs is from Oracle - it has some really nice features such as builtin support for raid 1/5/6, but also some really bad errors (it is still in testing, raid 5/6 specially, but raid 1 also has some issues).

ISO9660 filesystems for DVD/CD usage...

NTFS is supported (but you have to be careful if the system is also booted to windows - Windows has be fully shutdown before Linux can use the filesystem or it shows up as corrupted). Another problem is that NTFS doesn't have quite the same definition of user identification... and has some security weaknesses.

FAT16/32 is available for compatibility use but has no user identification at all, file handling is a bit peculiar from the point of view of Linux (text files for instance should have <cr><lf> for compatibility, but Linux files don't).
Quote:


Do you have a resource (website or document) that explain this in details? specially the delay part (Slow).
The delay part is common to all filesystems that have directories with a huge number of files - some are worse than others. Some use a linear search to find files in a directory, deletes can be worse (deleting the first file in a directory can require copying the entire directory up one entry... so deleting 100 files can cause 100 copies to be done).

To know which limitations apply to which filesystem, you have to search for design documents... For the Linux native ones that isn't too hard, but for those from outside, it is harder to find.

Soadyheid 01-05-2016 07:11 AM

Just a thought... Have you tried
Code:

# updatedb
which will index all your files and allow them to be found quickly by searching the index file it generates rather than trawling through the whole file system on each search? No idea if it'll work with "Millions of files", an initial pass will probably take quite a while to run, subsequent passes only update the index file with changes.
According to the man pages, it's generally run as a cron job overnight to update with any changes; new files, deletions...

Finding files is then by using the command
Code:

$ locate <filename>
Wildcards or partial spelling can be used but you can obviously generate a lot of output if you're not precise enough. Again, you could redirect the output to a file and work with that if needed.

Anyway... My :twocents:

Play Bonny!

:hattip:

reemcs 01-05-2016 03:03 PM

Quote:

Originally Posted by jpollard (Post 5473309)
Many. The Ext family contains 4 versions, (one has been dropped), Ext2,3,4 are closely related, Ext3 is an extension of Ext2 and has journaling supported. Ext4 is an extension of ext3 and includes the ability to support large block allocations to reduce the amount of metadata used. An Ext3 filesystem can be mounted Ext2... and ext4 can be mounted ext2/3 BUT once Ext4 has files with large block allocations (happens with large files) the new files show up as a corrupted filesystem when that is done.

xfs is another filesystem (from SGI) that is also designed to handle large files and large filesystems. It is also used a a base for a cluster filesystem from SGI (cxfs) that has proprietary parts.

There is jfs from IBM, riserfs (not used as much now) with alternate data segments like HFS from Apple..

btrfs is from Oracle - it has some really nice features such as builtin support for raid 1/5/6, but also some really bad errors (it is still in testing, raid 5/6 specially, but raid 1 also has some issues).

ISO9660 filesystems for DVD/CD usage...

NTFS is supported (but you have to be careful if the system is also booted to windows - Windows has be fully shutdown before Linux can use the filesystem or it shows up as corrupted). Another problem is that NTFS doesn't have quite the same definition of user identification... and has some security weaknesses.

FAT16/32 is available for compatibility use but has no user identification at all, file handling is a bit peculiar from the point of view of Linux (text files for instance should have <cr><lf> for compatibility, but Linux files don't).


The delay part is common to all filesystems that have directories with a huge number of files - some are worse than others. Some use a linear search to find files in a directory, deletes can be worse (deleting the first file in a directory can require copying the entire directory up one entry... so deleting 100 files can cause 100 copies to be done).

To know which limitations apply to which filesystem, you have to search for design documents... For the Linux native ones that isn't too hard, but for those from outside, it is harder to find.




----------------------------


Thank you very very much. for the answer and the prompt reply.

Qus:
Directory indexing: dir_index (dir_index which use hashed b-trees to speed up name lookups in large directories.)

This feature just for the directories, what about the files? or it also used to index the files in single directory?

reemcs 01-05-2016 03:04 PM

Quote:

Originally Posted by Soadyheid (Post 5473320)
Just a thought... Have you tried
Code:

# updatedb
which will index all your files and allow them to be found quickly by searching the index file it generates rather than trawling through the whole file system on each search? No idea if it'll work with "Millions of files", an initial pass will probably take quite a while to run, subsequent passes only update the index file with changes.
According to the man pages, it's generally run as a cron job overnight to update with any changes; new files, deletions...

Finding files is then by using the command
Code:

$ locate <filename>
Wildcards or partial spelling can be used but you can obviously generate a lot of output if you're not precise enough. Again, you could redirect the output to a file and work with that if needed.

Anyway... My :twocents:

Play Bonny!

:hattip:


---------------

Thank you so much will try it.

jpollard 01-05-2016 04:18 PM

Quote:

Originally Posted by reemcs (Post 5473541)
----------------------------


Thank you very very much. for the answer and the prompt reply.

Qus:
Directory indexing: dir_index (dir_index which use hashed b-trees to speed up name lookups in large directories.)

This feature just for the directories, what about the files? or it also used to index the files in single directory?

All directories. The contents of the files have to be handled by applications. The kernel can't do anything with those.

reemcs 01-05-2016 09:28 PM

Quote:

Originally Posted by jpollard (Post 5473568)
All directories. The contents of the files have to be handled by applications. The kernel can't do anything with those.

------

I don't mean the content of the files. I mean the content of the directory (many files in a single directory) How the kernal index these files? How the kernal search for a file in a directory containing many files? What method the kernal will use for the search? Hashing or linked list..?

Or Is it the same as directories (dir_index hash btree or there is a different way)

Thank you so much again.

jpollard 01-06-2016 04:09 AM

Quote:

Originally Posted by reemcs (Post 5473675)
------

I don't mean the content of the files. I mean the content of the directory (many files in a single directory) How the kernal index these files? How the kernal search for a file in a directory containing many files? What method the kernal will use for the search? Hashing or linked list..?

Or Is it the same as directories (dir_index hash btree or there is a different way)

Thank you so much again.

You would have to dig into the code for the specific filesystem to find out. I believe ext4 uses a hash for making quick checks, and has the directory in a tree. What makes things more complicated is that the contents of the tree are on disk in buckets, but in memory they are stored in cache to make the search shorter. There are also optimizations of disk access to speed up disk accesses.

You can find out some from https://ext4.wiki.kernel.org/index.php/Main_Page but for exact details you have to look at the source code.

reemcs 01-06-2016 04:11 AM

Quote:

Originally Posted by jpollard (Post 5473816)
You would have to dig into the code for the specific filesystem to find out. I believe ext4 uses a hash for making quick checks, and has the directory in a tree. What makes things more complicated is that the contents of the tree are on disk in buckets, but in memory they are stored in cache to make the search shorter. There are also optimizations of disk access to speed up disk accesses.

You can find out some from https://ext4.wiki.kernel.org/index.php/Main_Page but for exact details you have to look at the source code.

________

Thank you sooo much dear.

ondoho 01-06-2016 11:47 AM

somehow i can't shake the feeling that op is actually talking about a server, and something like HTML_DOCUMENT_ROOT.

and not really about filesystems at all.

jpollard 01-06-2016 08:24 PM

Quote:

Originally Posted by ondoho (Post 5473982)
somehow i can't shake the feeling that op is actually talking about a server, and something like HTML_DOCUMENT_ROOT.

and not really about filesystems at all.

That is where the question started - and asking how the system opens one file out of a possible million files...


All times are GMT -5. The time now is 07:40 PM.