efficient occurrence counting or alternative lexical analysis
ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
efficient occurrence counting or alternative lexical analysis
I have a database of textual scientific information to analyze. Essentially, I'll be working with a set of text labels (compound scientific nouns,) each associated with one or more scientific articles. I need to establish relative conceptual similarity between all terms occurring in the labels. As a start, my plan is to make a list of all terms occurring in labels, then count those terms in every article in the database. Two occurring in one or more articles together will be assigned a correlation value (algorithm to be determined.)
I'm trying to decide an efficient way to count the word occurrences. One idea was to create an AVL tree of words for each article (with an occurrence count for each word,) then extract the count for each term in the master list. This would exclude simple words such as "the", etc. Another idea is to create an AVL tree of the terms to search for, then match the words in each article to establish the occurrence counts.
At worst, I'll have ~60k unique terms to search for (with absolutely no correlations.) At best, maybe a few thousand (with many strong correlations.) The number of articles should be less than 30k. The resulting output should be an NxN matrix of correlation data for use with an LSA algorithm. The objective is to be able to take any two labels and assign a numerical correlation value to the pair. Because they're scientific terms originating from numerous countries, however, no reference exists to correlate them with each other. This is actually a pre-processing step for analysis of spatial data. Right now I'm working on an interim solution to get the rest of the system working, then the lexical analysis will be incrementally improved.
I'm trying to write this in C or C++, but I'm open to different solutions. Because this is just pre-processing to develop a reference table, it doesn't have to be compatible with the rest of the analysis code (but it would be nice.) I have to parse the actual data from XML and I plan to store it in an SQL database for processing. Thanks.
ta0kira
First, if it's important to reduce the computation time in chopping up the articles into words, investigate the use of lex and yacc. (That would be flex and bison in the Linux world.)
Second, if you're tempted to use AVL trees, you can avoid a whole bunch of coding by using libavl (overview/source, documentation). I use that library in a new project every few years. It's an old friend. It's flexible enough that putting in an occurrence count would be a snap.
Unfortunately the computation time will probably have little to do with the parsing itself. I'm still not quite sure what format I'll be able to access the articles in. Right now I know the database has links to the articles and (I think) the abstracts in text, so I might just parse the abstracts. In any case, I have to parse the pseudo-XML data to create the database, so the parser will be essentially complete already. I think I'll use my own parsing library for that (not so much a parser as a tree assembler, modifier, and exporter; I should rename it.)
The intensive part will be analysis after the term cross-references are tabulated. I'll be happy with an O(n^3) network analysis algorithm, and with potentially 30k nodes, the optimization will come from localizing smaller networks. We really haven't decided what our exact objective is, but we do know that a lexical correlation table is the foundation for everything else.
I'll take a look at that AVL implementation. I actually wrote one a few months ago (although in Java,) but because this is research and my job isn't to reinvent the wheel, I'm looking for every preexisting (open-source) resource I can find. Thanks.
ta0kira
That definitely sounds like a job for Perl. I've seen similar qns asked at www.perlmonks.org and there's almost definitely some modules at search.cpan.org to do some or more of what you want.
Might be worth asking the monks.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.