WRT using the least resources, AFAIK
slimerjs is free. If the resources your concerned about is the time it would take someone to learn to use
slimerjs, then there might be another way.
If toolkit that's been designed to isolate string values based upon a pattern is intended to work with a file on disk, and that's you talked about "dump out the html source that it has rendered to a text file", then it sounds like your headed into "The Kludge Zone". If so, and since you mentioned AJAX, if you don't have a cleaner way to do what you need, readily available to you, you might want to consider using the linux
xdotool command. The
xdotool command can manipulate windows, move them, resize them, provide them with text input, mouse events, etc.
It sounds as if might be used like this in your situation:
1) Get a list of item ID's for which you wish to get new data, put the ID's into a file.
2) Use the file as input to a shell script which behaves as described in the following steps.
3) Start a browser telling it to bring up a particular web page. With a Firefox profile named 'simple' that could look like this:
Code:
firefox -p simple 'http://www.your_providers_domain.com/item_query.html' &
4) Use the
sleep command to wait plenty of seconds expecting that the browser will be finished loading by then.
5) Run the
xdotool command with options to find the browser's window which is open to the web page, then have it position and size the browser's window exactly.
6) With the browser window given a fixed size and position on the screen, you can use a "pixel ruler" such as
KRuler to determine the exact X and Y pixel coordinates on the screen of the web page's input fields.
7) Read an item ID from the file containing a list of item ID's.
8) Run the
xdotool command with options to enter the item ID at the exact X and Y pixel coordinates on the screen of the input for an item ID, and to press the on screen
button to submit the request to the web site.
9) Use the
sleep command to wait plenty of seconds expecting that the browser will have the data by then.
10) Run the
xdotool command with options to save the web page content to a file with a file name based on the item ID.
11) Loop back to number 7 until there are no more item ID's.
12) Run your toolkit to grab the new data from the saved files and update your database.
WRT item 4, to determine what number of seconds should be given to the sleep command, manually run the browser a few times to get an idea of a reasonable number of seconds to sleep inside the shell script.
WRT item 9, manually request data a few times to get an idea of a reasonable number of seconds to sleep inside the shell script.
If that makes sense for what you're doing but you'd like to make it cleaner, and if you're good with Javascript, use a browser extension such as Scriptish, which can run your own Javascript as it were part of the web page that was loaded. I use Scriptish because when I tried GreaseMonkey ( on which Scriptish is based ) GM didn't seem to work very well.
Have Scriptish run Javascript as if it were part of your Provider's web page, access the real document inside the wrapper ( the concept is discussed in the docs ), and for example, change the title of the web page once the requested data has been loaded; the new title can be almost the same as the old, but contain the item ID.
Instead of using a long fixed sleep while waiting for data, modify the shell script to loop with a much shorter sleep; but each time through the loop,
xdotool searches for the browser window title with the desired item ID in it. If it's found, it's known that the new data has been loaded into the web page.
A
HUGE KLUDGE to be sure, but it's also an approach that can be thrown together very quickly, is an automated way to update the database with the new info. YES, it DOES depend on the appearance of your provider's web page. So, at least in this regard follow good non-kludge programming practices and put the various X and Y pixel coordinates to be passed to the various uses of
xdotool into shell script variables, so they can easily be changed near the beginning of the shell script if your provider's web page structure changes.
Only a simple manual check of the provider's web page structure is needed before subsequent automated updates are done.
I've actually built one or two things this way using a "state machine" pattern matching engine I wrote to find data in text files based on a sequence of patterns. I've given the result to people, it's worked for them, and they liked it.
I would normally prefer a much more connected/cleaner way of doing things, but if time is short and it doesn't have to polished/fancy, just work...
I hope this is in the spirit of how you're trying to accomplish your goal, makes the sense the way I've explained it, helps you, or if not, gives you some ideas that might be useful to you!