LinuxQuestions.org
Latest LQ Deal: Linux Power User Bundle
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 01-05-2016, 04:51 AM   #1
reemcs
LQ Newbie
 
Registered: Jan 2016
Posts: 7

Rep: Reputation: Disabled
Smile Search in Linux


Hello ..

I have a website , in my website code there is a link (e.g. ....../myfolder/file001.pdf) this link is used to access a file in the linux system.
My question is: How the linux search for a file in the folder? especially if we have million of files inside the folder.
I want to know the searching method? Is there a special algorithm, or it will go through all the files (file by file) and chack the file name?
Or it will use the find method in linux?


Thanks in advance.
 
Old 01-05-2016, 05:31 AM   #2
jpollard
Senior Member
 
Registered: Dec 2012
Location: Washington DC area
Distribution: Fedora, CentOS, Slackware
Posts: 4,714

Rep: Reputation: 1280Reputation: 1280Reputation: 1280Reputation: 1280Reputation: 1280Reputation: 1280Reputation: 1280Reputation: 1280Reputation: 1280
It depends on what you are referring to.

Searching for a specific file, searching for a file with a partial name?

If "how does linux locate the file to read" is the question then the answer depends on the filesystem used. Most of the current filesystems (ext4 for instance) are using trees to store the directory, and the file name itself is hashed to make a short key to make quick matches - if the short key matches, then it verifies by using the full name.

It is still a bit slow if the directory has "million of files".. There are also other problems caused by that (slower to backup... harder for people to scan the file names, longer search when you don't know the exact name, too many names when performing file maintenance...)

Last edited by jpollard; 01-05-2016 at 05:36 AM.
 
1 members found this post helpful.
Old 01-05-2016, 06:10 AM   #3
reemcs
LQ Newbie
 
Registered: Jan 2016
Posts: 7

Original Poster
Rep: Reputation: Disabled
Quote:
Originally Posted by jpollard View Post
It depends on what you are referring to.

Searching for a specific file, searching for a file with a partial name?

If "how does linux locate the file to read" is the question then the answer depends on the filesystem used. Most of the current filesystems (ext4 for instance) are using trees to store the directory, and the file name itself is hashed to make a short key to make quick matches - if the short key matches, then it verifies by using the full name.

It is still a bit slow if the directory has "million of files".. There are also other problems caused by that (slower to backup... harder for people to scan the file names, longer search when you don't know the exact name, too many names when performing file maintenance...)


-------------------------------------
Thank you very much.

Yes .. Searching for a specific file in the system.

Each Linux system support one filesystem or more than one filesystem?

Do you have a resource (website or document) that explain this in details? specially the delay part (Slow).
 
Old 01-05-2016, 07:07 AM   #4
ondoho
LQ Addict
 
Registered: Dec 2013
Posts: 7,031
Blog Entries: 4

Rep: Reputation: 1687Reputation: 1687Reputation: 1687Reputation: 1687Reputation: 1687Reputation: 1687Reputation: 1687Reputation: 1687Reputation: 1687Reputation: 1687Reputation: 1687
Quote:
Originally Posted by reemcs View Post
in my website code there is a link (e.g. ....../myfolder/file001.pdf)
can you show us the html code?
what does the "......" stand for?
 
Old 01-05-2016, 07:31 AM   #5
reemcs
LQ Newbie
 
Registered: Jan 2016
Posts: 7

Original Poster
Rep: Reputation: Disabled
Quote:
Originally Posted by ondoho View Post
can you show us the html code?
what does the "......" stand for?
Sorry .. I did't write the code till now. I just want to know if I have the exact path for a file how the Linux works to locate/find this file.
I'll start the system next week.
 
Old 01-05-2016, 07:37 AM   #6
jpollard
Senior Member
 
Registered: Dec 2012
Location: Washington DC area
Distribution: Fedora, CentOS, Slackware
Posts: 4,714

Rep: Reputation: 1280Reputation: 1280Reputation: 1280Reputation: 1280Reputation: 1280Reputation: 1280Reputation: 1280Reputation: 1280Reputation: 1280
Quote:
Originally Posted by reemcs View Post
-------------------------------------
Thank you very much.

Yes .. Searching for a specific file in the system.

Each Linux system support one filesystem or more than one filesystem?
Many. The Ext family contains 4 versions, (one has been dropped), Ext2,3,4 are closely related, Ext3 is an extension of Ext2 and has journaling supported. Ext4 is an extension of ext3 and includes the ability to support large block allocations to reduce the amount of metadata used. An Ext3 filesystem can be mounted Ext2... and ext4 can be mounted ext2/3 BUT once Ext4 has files with large block allocations (happens with large files) the new files show up as a corrupted filesystem when that is done.

xfs is another filesystem (from SGI) that is also designed to handle large files and large filesystems. It is also used a a base for a cluster filesystem from SGI (cxfs) that has proprietary parts.

There is jfs from IBM, riserfs (not used as much now) with alternate data segments like HFS from Apple..

btrfs is from Oracle - it has some really nice features such as builtin support for raid 1/5/6, but also some really bad errors (it is still in testing, raid 5/6 specially, but raid 1 also has some issues).

ISO9660 filesystems for DVD/CD usage...

NTFS is supported (but you have to be careful if the system is also booted to windows - Windows has be fully shutdown before Linux can use the filesystem or it shows up as corrupted). Another problem is that NTFS doesn't have quite the same definition of user identification... and has some security weaknesses.

FAT16/32 is available for compatibility use but has no user identification at all, file handling is a bit peculiar from the point of view of Linux (text files for instance should have <cr><lf> for compatibility, but Linux files don't).
Quote:

Do you have a resource (website or document) that explain this in details? specially the delay part (Slow).
The delay part is common to all filesystems that have directories with a huge number of files - some are worse than others. Some use a linear search to find files in a directory, deletes can be worse (deleting the first file in a directory can require copying the entire directory up one entry... so deleting 100 files can cause 100 copies to be done).

To know which limitations apply to which filesystem, you have to search for design documents... For the Linux native ones that isn't too hard, but for those from outside, it is harder to find.
 
Old 01-05-2016, 08:11 AM   #7
Soadyheid
Senior Member
 
Registered: Aug 2010
Location: Near Edinburgh, Scotland
Distribution: Cinnamon Mint 17.3 and 18.2 at present.
Posts: 1,301

Rep: Reputation: 302Reputation: 302Reputation: 302Reputation: 302
Just a thought... Have you tried
Code:
# updatedb
which will index all your files and allow them to be found quickly by searching the index file it generates rather than trawling through the whole file system on each search? No idea if it'll work with "Millions of files", an initial pass will probably take quite a while to run, subsequent passes only update the index file with changes.
According to the man pages, it's generally run as a cron job overnight to update with any changes; new files, deletions...

Finding files is then by using the command
Code:
$ locate <filename>
Wildcards or partial spelling can be used but you can obviously generate a lot of output if you're not precise enough. Again, you could redirect the output to a file and work with that if needed.

Anyway... My

Play Bonny!


Last edited by Soadyheid; 01-05-2016 at 08:13 AM.
 
Old 01-05-2016, 04:03 PM   #8
reemcs
LQ Newbie
 
Registered: Jan 2016
Posts: 7

Original Poster
Rep: Reputation: Disabled
Quote:
Originally Posted by jpollard View Post
Many. The Ext family contains 4 versions, (one has been dropped), Ext2,3,4 are closely related, Ext3 is an extension of Ext2 and has journaling supported. Ext4 is an extension of ext3 and includes the ability to support large block allocations to reduce the amount of metadata used. An Ext3 filesystem can be mounted Ext2... and ext4 can be mounted ext2/3 BUT once Ext4 has files with large block allocations (happens with large files) the new files show up as a corrupted filesystem when that is done.

xfs is another filesystem (from SGI) that is also designed to handle large files and large filesystems. It is also used a a base for a cluster filesystem from SGI (cxfs) that has proprietary parts.

There is jfs from IBM, riserfs (not used as much now) with alternate data segments like HFS from Apple..

btrfs is from Oracle - it has some really nice features such as builtin support for raid 1/5/6, but also some really bad errors (it is still in testing, raid 5/6 specially, but raid 1 also has some issues).

ISO9660 filesystems for DVD/CD usage...

NTFS is supported (but you have to be careful if the system is also booted to windows - Windows has be fully shutdown before Linux can use the filesystem or it shows up as corrupted). Another problem is that NTFS doesn't have quite the same definition of user identification... and has some security weaknesses.

FAT16/32 is available for compatibility use but has no user identification at all, file handling is a bit peculiar from the point of view of Linux (text files for instance should have <cr><lf> for compatibility, but Linux files don't).


The delay part is common to all filesystems that have directories with a huge number of files - some are worse than others. Some use a linear search to find files in a directory, deletes can be worse (deleting the first file in a directory can require copying the entire directory up one entry... so deleting 100 files can cause 100 copies to be done).

To know which limitations apply to which filesystem, you have to search for design documents... For the Linux native ones that isn't too hard, but for those from outside, it is harder to find.



----------------------------


Thank you very very much. for the answer and the prompt reply.

Qus:
Directory indexing: dir_index (dir_index which use hashed b-trees to speed up name lookups in large directories.)

This feature just for the directories, what about the files? or it also used to index the files in single directory?
 
Old 01-05-2016, 04:04 PM   #9
reemcs
LQ Newbie
 
Registered: Jan 2016
Posts: 7

Original Poster
Rep: Reputation: Disabled
Quote:
Originally Posted by Soadyheid View Post
Just a thought... Have you tried
Code:
# updatedb
which will index all your files and allow them to be found quickly by searching the index file it generates rather than trawling through the whole file system on each search? No idea if it'll work with "Millions of files", an initial pass will probably take quite a while to run, subsequent passes only update the index file with changes.
According to the man pages, it's generally run as a cron job overnight to update with any changes; new files, deletions...

Finding files is then by using the command
Code:
$ locate <filename>
Wildcards or partial spelling can be used but you can obviously generate a lot of output if you're not precise enough. Again, you could redirect the output to a file and work with that if needed.

Anyway... My

Play Bonny!


---------------

Thank you so much will try it.
 
Old 01-05-2016, 05:18 PM   #10
jpollard
Senior Member
 
Registered: Dec 2012
Location: Washington DC area
Distribution: Fedora, CentOS, Slackware
Posts: 4,714

Rep: Reputation: 1280Reputation: 1280Reputation: 1280Reputation: 1280Reputation: 1280Reputation: 1280Reputation: 1280Reputation: 1280Reputation: 1280
Quote:
Originally Posted by reemcs View Post
----------------------------


Thank you very very much. for the answer and the prompt reply.

Qus:
Directory indexing: dir_index (dir_index which use hashed b-trees to speed up name lookups in large directories.)

This feature just for the directories, what about the files? or it also used to index the files in single directory?
All directories. The contents of the files have to be handled by applications. The kernel can't do anything with those.
 
Old 01-05-2016, 10:28 PM   #11
reemcs
LQ Newbie
 
Registered: Jan 2016
Posts: 7

Original Poster
Rep: Reputation: Disabled
Quote:
Originally Posted by jpollard View Post
All directories. The contents of the files have to be handled by applications. The kernel can't do anything with those.
------

I don't mean the content of the files. I mean the content of the directory (many files in a single directory) How the kernal index these files? How the kernal search for a file in a directory containing many files? What method the kernal will use for the search? Hashing or linked list..?

Or Is it the same as directories (dir_index hash btree or there is a different way)

Thank you so much again.
 
Old 01-06-2016, 05:09 AM   #12
jpollard
Senior Member
 
Registered: Dec 2012
Location: Washington DC area
Distribution: Fedora, CentOS, Slackware
Posts: 4,714

Rep: Reputation: 1280Reputation: 1280Reputation: 1280Reputation: 1280Reputation: 1280Reputation: 1280Reputation: 1280Reputation: 1280Reputation: 1280
Quote:
Originally Posted by reemcs View Post
------

I don't mean the content of the files. I mean the content of the directory (many files in a single directory) How the kernal index these files? How the kernal search for a file in a directory containing many files? What method the kernal will use for the search? Hashing or linked list..?

Or Is it the same as directories (dir_index hash btree or there is a different way)

Thank you so much again.
You would have to dig into the code for the specific filesystem to find out. I believe ext4 uses a hash for making quick checks, and has the directory in a tree. What makes things more complicated is that the contents of the tree are on disk in buckets, but in memory they are stored in cache to make the search shorter. There are also optimizations of disk access to speed up disk accesses.

You can find out some from https://ext4.wiki.kernel.org/index.php/Main_Page but for exact details you have to look at the source code.
 
Old 01-06-2016, 05:11 AM   #13
reemcs
LQ Newbie
 
Registered: Jan 2016
Posts: 7

Original Poster
Rep: Reputation: Disabled
Quote:
Originally Posted by jpollard View Post
You would have to dig into the code for the specific filesystem to find out. I believe ext4 uses a hash for making quick checks, and has the directory in a tree. What makes things more complicated is that the contents of the tree are on disk in buckets, but in memory they are stored in cache to make the search shorter. There are also optimizations of disk access to speed up disk accesses.

You can find out some from https://ext4.wiki.kernel.org/index.php/Main_Page but for exact details you have to look at the source code.
________

Thank you sooo much dear.
 
Old 01-06-2016, 12:47 PM   #14
ondoho
LQ Addict
 
Registered: Dec 2013
Posts: 7,031
Blog Entries: 4

Rep: Reputation: 1687Reputation: 1687Reputation: 1687Reputation: 1687Reputation: 1687Reputation: 1687Reputation: 1687Reputation: 1687Reputation: 1687Reputation: 1687Reputation: 1687
somehow i can't shake the feeling that op is actually talking about a server, and something like HTML_DOCUMENT_ROOT.

and not really about filesystems at all.
 
Old 01-06-2016, 09:24 PM   #15
jpollard
Senior Member
 
Registered: Dec 2012
Location: Washington DC area
Distribution: Fedora, CentOS, Slackware
Posts: 4,714

Rep: Reputation: 1280Reputation: 1280Reputation: 1280Reputation: 1280Reputation: 1280Reputation: 1280Reputation: 1280Reputation: 1280Reputation: 1280
Quote:
Originally Posted by ondoho View Post
somehow i can't shake the feeling that op is actually talking about a server, and something like HTML_DOCUMENT_ROOT.

and not really about filesystems at all.
That is where the question started - and asking how the system opens one file out of a possible million files...
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Linux custom search - replacing Google's Linux search Chriswaterguy Linux - News 3 06-04-2012 05:33 AM
Search tools (Affinity, Tracker Search Tool, etc.) not working - don't find any files Adamantus Linux - Newbie 1 03-30-2009 12:21 AM
possible search crash.../home/httpd/linuxquestions/questions/search.php aus9 LQ Suggestions & Feedback 3 09-06-2008 08:27 PM
Can you make search ...search a string in a link....a url...a web address aus9 LQ Suggestions & Feedback 4 04-16-2008 10:37 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie

All times are GMT -5. The time now is 04:57 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration