LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Software (https://www.linuxquestions.org/questions/linux-software-2/)
-   -   Text file data into Postgres database (https://www.linuxquestions.org/questions/linux-software-2/text-file-data-into-postgres-database-733755/)

Sum1 06-17-2009 09:12 PM

Text file data into Postgres database
 
Hello Group,

I've wanted to start this project for a long time but never knew where to start. I figure it starts with finding out what you don't know. Any guidance and suggestions are greatly appreciated.

Background: I have thousands of folders and files (approx. 650,000 files) filled with client
data on my workplace linux server. I cannot encourage, guide, cajole,
or threaten anyone into following any set procedures for naming
schemes or business data procedures. The bosses don't see it as a
priority and everyone has the attitude "as long as I know where my
stuff is, I don't care." But what happens is, when people want to
check out others' research or other written work from the past, it's
difficult to find and identify the work product. And then of course I
receive complaints about how crappy "MY" server and network is. I see
folks running "Search" from the Win XP Start menu and scanning
multiple hard drives. It's a waste of time and server resources.

Objective: Make a searchable database that will produce results in 1 or 2 seconds. Use php, postgresql, and apache to host this
LAN-only accessible app.

Procedural Posture:

1. Obtain file and directory names and locations. [Completed]
2. Use data mining techniques to pull the necessary data out of "1."
and place it in a postgresql database.
3. Build the php/apache interface to perform searches by SQL queries.

- - - - - - - - - -

I believe No. 1 is completed. I used the following command to gather all file and directory names. "/abc" is the main directory on the linux samba server where all company data is located:

root@acme:/# ls -Ralh /abc > /home/mike/file_output.txt

file_output.txt is 33 megs. and lists approximately 650,000 file names and their directory paths.

I want to pull all the file name info. and directory info. from
"file_output.txt".
I want to place each file name into the database in one row and have
the related file location info. stated in the next database row.
Ultimately I want it to show up in the web. app. as a clickable link
that will open the document.

How do I pull the data from "file_output.txt" and place it into the postgres database?

Due to years of ever-changing file-naming schemes and changing directory paths, I simply want to provide a search field where users can type in any alpha-numeric sequence and pull matching results.

jan61 06-18-2009 01:38 PM

Moin,

I think, you could get better loadable results using find instead of ls. Try find in combination with the -printf option to get a list of the files you want to store in the database. To load the file, look for the COPY PostgreSQL command.

Jan

EDIT: In my opinion it's not a good idea to store the file's name and it's location in different rows. It would be a better idea to store the locations (i. e. directories) in one table with a unique ID and to store the filename and a reference to it's location in another table each in one row.

Tinkster 06-18-2009 10:28 PM

Quote:

Originally Posted by Sum1 (Post 3577775)
Hello Group,

I've wanted to start this project for a long time but never knew where to start. I figure it starts with finding out what you don't know. Any guidance and suggestions are greatly appreciated.

Background: I have thousands of folders and files (approx. 650,000 files) filled with client
data on my workplace linux server. I cannot encourage, guide, cajole,
or threaten anyone into following any set procedures for naming
schemes or business data procedures. The bosses don't see it as a
priority and everyone has the attitude "as long as I know where my
stuff is, I don't care." But what happens is, when people want to
check out others' research or other written work from the past, it's
difficult to find and identify the work product. And then of course I
receive complaints about how crappy "MY" server and network is. I see
folks running "Search" from the Win XP Start menu and scanning
multiple hard drives. It's a waste of time and server resources.

Objective: Make a searchable database that will produce results in 1 or 2 seconds. Use php, postgresql, and apache to host this
LAN-only accessible app.

Procedural Posture:

1. Obtain file and directory names and locations. [Completed]
2. Use data mining techniques to pull the necessary data out of "1."
and place it in a postgresql database.
3. Build the php/apache interface to perform searches by SQL queries.

- - - - - - - - - -

I believe No. 1 is completed. I used the following command to gather all file and directory names. "/abc" is the main directory on the linux samba server where all company data is located:

root@acme:/# ls -Ralh /abc > /home/mike/file_output.txt

file_output.txt is 33 megs. and lists approximately 650,000 file names and their directory paths.

I want to pull all the file name info. and directory info. from
"file_output.txt".
I want to place each file name into the database in one row and have
the related file location info. stated in the next database row.
Ultimately I want it to show up in the web. app. as a clickable link
that will open the document.

How do I pull the data from "file_output.txt" and place it into the postgres database?

Due to years of ever-changing file-naming schemes and changing directory paths, I simply want to provide a search field where users can type in any alpha-numeric sequence and pull matching results.




Silly question - why don't you use locate?



Cheers,
Tink

jlinkels 06-19-2009 05:46 PM

I had the hope the intention of the OP is to add tags in order to improve the search capabilities. Searchin file names is fine if you know what to look for and then locate is fine as you say.

jlinkels

jan61 06-20-2009 11:28 AM

Moin,

Quote:

Originally Posted by Tinkster (Post 3579102)
Silly question - why don't you use locate?

perhaps, because it's not available on the WinXP clients, which use his samba server?

Jan

jlinkels 06-20-2009 12:19 PM

but you can execute the locate from the PHP script and show the results in a table anyway...

jlinkels

jan61 06-21-2009 04:56 AM

Moin,

Quote:

Originally Posted by jlinkels (Post 3580699)
but you can execute the locate from the PHP script and show the results in a table anyway

That's not a way to search for e. g. files owned by a specific user, modified n days ago, ...

Jan

jlinkels 06-21-2009 07:21 AM

I only showed that locate could be used from XP. But you are right that it would be better to include more search keys.

Having said that, it seems that the OP did not plan to store that information in his database. Looking at his first post he intended to store only file names and paths. Unfortunately the OP doesn't seem to be interested in the discussion anymore.

jlinkels

Sum1 06-23-2009 09:33 AM

No!
I am very interested in this thread and appreciate your time and assistance.
My sincere apologies for not answering your responses more quickly.
I posted and then the next day was called away on a family responsibility.
I was checking email but for some reason didn't receive response notifications --- (need to check my forum settings).

I am at work and will try to sit down with your input this evening.

Thank you for your help.

Sum1 06-23-2009 10:40 AM

Quote:

Originally Posted by jan61 (Post 3578640)
Moin,

I think, you could get better loadable results using find instead of ls. Try find in combination with the -printf option to get a list of the files you want to store in the database. To load the file, look for the COPY PostgreSQL command.

Jan

EDIT: In my opinion it's not a good idea to store the file's name and it's location in different rows. It would be a better idea to store the locations (i. e. directories) in one table with a unique ID and to store the filename and a reference to it's location in another table each in one row.

Heh, took a few moments at work to try - find.
Excellent - this is the format I was looking for: directory path and file name, line by line.

I did ~$find /main_directory > /home/mike/test_find_output.txt

So now it's time to learn something about Postgresql's COPY command and make some trouble with databases. :-)

Thanks, Jan. Sometimes a little help with a small step can help someone make greater/larger steps in the future.

- - - - -

Thanks to jlinkels too -- I believe I understand your approach and how it could be used. I may see if I can learn to use this too.

jan61 06-23-2009 01:47 PM

Moin,

Quote:

Originally Posted by Sum1 (Post 3583481)
I did ~$find /main_directory > /home/mike/test_find_output.txt

you are working on a Linux system, right? In this case find has much more possibilities to fill your database. I already mentioned the -printf option - this option can help you to print out additional information (such as file type - is it a directory or a file?, last modification time, size, owner ...) in a format easy to use with COPY.

For example:
Code:

find /main -printf "%h|%f|%y|%s|%T+|%u|%g\n"
will give you:
- the directory name
- the file name
- the file type
- the size in bytes
- the last modification time
- the owner
- the group

in a "|" delimited format, which is easy to load into a database. To search for regular files only, add the "-type f" option. Much more is possible, look at the manual page.

Jan

Sum1 06-23-2009 04:10 PM

Quote:

Originally Posted by jan61 (Post 3583654)
you are working on a Linux system, right?

in a "|" delimited format, which is easy to load into a database. To search for regular files only, add the "-type f" option. Much more is possible, look at the manual page.

Hi Jan,

Yes, all linux on the server and Win XP Pro clients connecting to it.
Slackware ver. 12.2.

Thank you very much for the guidance.
I am looking at man find and google foo examples.

Regards,

Mike


All times are GMT -5. The time now is 12:21 AM.