LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Software (https://www.linuxquestions.org/questions/linux-software-2/)
-   -   What software can I use to extract highlighted text from a document? (https://www.linuxquestions.org/questions/linux-software-2/what-software-can-i-use-to-extract-highlighted-text-from-a-document-928668/)

LAPIII 02-10-2012 10:42 AM

What software can I use to extract highlighted text from a document?
 
I want to extract highlighted notes all at once. Preferably while the original document is open so that I can compare side-by-side.

amani 02-10-2012 12:18 PM

If you do not mention the format, then there are too many answers.

LAPIII 02-10-2012 06:27 PM

Plain text, i.e. .TXT, and anything similar, e.g. .ODT

SecretCode 02-11-2012 03:07 AM

Plain text and ODT are not similar.

What do you mean by "highlighting" in a plain text file?

LAPIII 02-11-2012 11:34 AM

1 Attachment(s)
I mean highlight text like

SecretCode 02-11-2012 01:35 PM

Well that's a PDF, which is different from plain text and different from .odt.

I think you'll have to restate your question much more clearly.

Nominal Animal 02-11-2012 02:04 PM

In Linux, highlighted text is automatically put on the PRIMARY clipboard.

X11 has two main clipboards, PRIMARY and CLIPBOARD. Almost all applications copy highlighted text to the PRIMARY clipboard. Edit>Copy (Ctrl+C) and Edit>Cut (Ctrl+X) copy the selection to the CLIPBOARD clipboard. Correspondingly, mid-button click on the mouse pastes data from the PRIMARY clipboard, while Edit>Paste (Ctrl+V) pastes from the CLIPBOARD. (There is usually also a third, the SECONDARY clipboard, but I haven't seen it used.)

It is very easy to monitor the clipboard contents. It is easiest using a C or C++ program (because you can use the X11 libraries directly, no GUI toolkit dependencies), but you can do it even using a simple shell script using the xclip utility. For example:
Code:

#!/bin/bash

if [ $# -gt 1 ] || [ "$1" = "-h" ] || [ "$1" = "--help" ]; then
    exec >&2
    echo ""
    echo "Usage: $0 FILENAME"
    echo ""
    echo "This script will auto-append primary clipboard"
    echo "contents to FILENAME."
    echo ""
    exit 0
fi

if [ $# -eq 1 ]; then
    exec >>"$1"
    shift 1
fi

WORK="$(mktemp -d)" || exit $?
trap "rm -rf '$WORK'" EXIT

while [ 1 ]; do
    mv -f "$WORK/curr" "$WORK/prev" 2>/dev/null
    xclip -selection primary -o > "$WORK/curr"
    if ! cmp -s "$WORK/prev" "$WORK/curr" ; then
        cat "$WORK/curr"
        echo ''
    else
        sleep .25
    fi
done

Note that the echo '' line adds a newline after each new selection is pasted. You might wish to edit it.

To use the above script, you can e.g. tee its output to a file or supply a file name on the command line to keep updating the file (like tail -f). If you want to supply the file to a text editor of some sort, you'll likely need to send a signal or message to tell that editor to reload the file after each change; because the communications method depends totally on the text editor of your choice, I omitted that stuff.

Hope this helps,

LAPIII 02-11-2012 04:34 PM

1 Attachment(s)
Sorry about the PDF image above, I will attach an image of highlighted text in a regular text document. Some text editors have the option to highlight different syntax and extract them all at the same time.

Nominal Animal 02-11-2012 05:56 PM

Unfortunately that kind of highlighting is usually not a selection (automatically copied to a clipboard), only a visual effect. Since it is justa visual effect, it is very difficult to capture.

What you need to do, is to extend your editor to copy the highlighted items to a clipboard. (Since it can add a visual effect to the items, it certainly can copy them to the clipboard, too. I do not know why most editors lack a "copy all highlighted text to clipboard" option; it should not be difficult to implement.)

On the other hand, you can extend the simple script I supplied above. Instead of copying all selected texts, let the script handle the selection. This variant saves all words that contain (but do not start with) "at":
Code:

#!/bin/bash

if [ $# -gt 1 ] || [ "$1" = "-h" ] || [ "$1" = "--help" ]; then
    exec >&2
    echo ""
    echo "Usage: $0 FILENAME"
    echo ""
    echo "This script will auto-append primary clipboard"
    echo "contents to FILENAME."
    echo ""
    exit 0
fi

if [ $# -eq 1 ]; then
    exec >>"$1"
    shift 1
fi

WORK="$(mktemp -d)" || exit $?
trap "rm -rf '$WORK'" EXIT

while [ 1 ]; do
    mv -f "$WORK/curr" "$WORK/prev" 2>/dev/null
    xclip -selection primary -o | tr -s '\t\n\v\f\r ' '\n\n\n\n\n\n' | sed -ne '/\w\+at/ p' > "$WORK/curr"
    if ! cmp -s "$WORK/prev" "$WORK/curr" ; then
        cat "$WORK/curr"
    else
        sleep .25
    fi
done

Here, the tr command translates all whitespace to newlines. The sed command only prints lines that contain at followed by at least one letter, digit, or underscore ("word character").

If you want to save all unique words matching that pattern, use awk:
Code:

#!/bin/bash

if [ $# -gt 1 ] || [ "$1" = "-h" ] || [ "$1" = "--help" ]; then
    exec >&2
    echo ""
    echo "Usage: $0 FILENAME"
    echo ""
    echo "This script will auto-append primary clipboard"
    echo "contents to FILENAME."
    echo ""
    exit 0
fi

if [ $# -gt 0 ] && [ "$1" != "-" ]; then
    file="$1"
else
    file=""
fi

while [ 1 ]; do
    xclip -selection primary -o
    echo
    sleep .25
done | awk -v file="$file" '
    BEGIN {
        RS="[\t\n\v\f\r ]"  # All words are records
        FS=RS              # No field splitting (same as RS)
    }

    /\w+at/ {
        if ($0 in words)
            next

        words[$0]          # Looks funny, but adds key $0 into words array

        # Save all unique words to file
        if (length(file) > 0) {
            for (w in words)
                printf("%s\n", w) > file
            close(file)
        } else {
            printf("\033[H\033[2J")  # Clear
            for (w in words)
                printf("%s\n", w)
        }
    }
'

With this script, you first run this script, then just select all the text you want to check for words matching the pattern. Wait at least half a second, before selecting a new area, or closing the editor or reader. It does not matter if you select the same words; the awk script will maintain only a list of unique words it sees. It should work with all editors and viewers in Linux.

If you supply a filename to the script, it will update that file. If you don't supply a filename, it will clear the terminal then write the current word list each time the list changes.

To be honest, I don't really think this is what you are after. This does solve the problem statement you supplied, but I think your real need might be something different. I'd recommend you describe the problem you want solved instead of trying to work a specific solution -- if my intuition is correct and this does not solve your actual problem. (I'm making quite a few assumptions here, so if I'm wrong, please don't be offended: I'm just trying to help.)

David the H. 02-12-2012 05:17 AM

I think you need to define exactly what you want to extract from the selection, how you want the results displayed, and how you want to run it (e.g., by hotkey?). As it stands the request is still too vague to give anything but general advice on. Break it down with a detailed example.


The primary selection isn't really a "clipboard" a such, as there is no buffer where the text is ever saved. It's actually a data transfer function; when you middle-click, all text in the currently (or-most-recently) highlighted area is directly transferred from that open process to the target one. That's why you lose the pasting ability if you close a process after highlighting it.

The clipboard is a true save buffer, however, and will persist after the process is closed.

I've never found a clear definition of what the secondary is, but as a wild guess I think it may be the way the x-server keeps track of the last-highlighted block of text, if the actual visual highlighting is cleared. It's not particularly important for general use in any case.


xclip or xsel can be used to view and access the contents of either of them. I prefer xsel myself. A bit of scripting can be used to filter out the contents you want, as NA has demonstrated. I've written a script or two myself using them, like one that will search for links in the selection and open them in my browser, or do a google search on the text if none are found. I have it bound to a keyboard shortcut so I can run it on demand.

At the very least it's quite easy to transfer data to and from primary and clipboard using them.

Code:

xsel | xsel -i -b        #transfer selection to clipboard
xsel -b | xsel -i        #transfer clipboard to selection


LAPIII 02-12-2012 07:09 AM

The best example of what I want is to hold Ctrl and highlight words, then drag & drop to a blank document.


All times are GMT -5. The time now is 10:10 AM.