LinuxQuestions.org
Visit Jeremy's Blog.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 05-23-2024, 09:59 AM   #1
lxs602
Member
 
Registered: Oct 2018
Distribution: Ubuntu 24.04
Posts: 50

Rep: Reputation: 1
Selecting columns containing a value (Python, pandas)


Hi,

I am a beginner with using Python.

Task:
From a few thousand excel spreadsheets (.xlsx) in the current directory, I would like to save each sheet separately, but only columns somewhere containing certain strings (e.g. 'Team', 'Total').

I have the code so far to save the sheets, but I am lost in guides and documentation as to select the columns with matching data. The columns have no headers.

General aim:
...To impress the boss! I had wanted to learn python/R/Excel for analysing data, but if this goes well, perhaps they might pay for a few exams in Excel. I am working in admin at the moment, for which I am grateful, but I would like to do something more analytical. This task came up just as I have been starting to learn Python, so some of this is still ahead of me at the moment.

Code:
import pandas as pd
import os
import openpyxl
# print("Copying sheets from multiple files to one file")
cwd = os.path.abspath('') 
files = os.listdir(cwd)  

for file in files:                         # loop through Excel files
    if file.endswith('.xls') or file.endswith('.xlsx'):
        excel_file = pd.ExcelFile(file)
             
        sheets = excel_file.sheet_names
        for sheet in sheets:               # loop through sheets inside an Excel file
            print (file, sheet)
            df = excel_file.parse(sheet_name = sheet)
            df.dropna(inplace=True)
            # df[df.dropna().str.contains('Total|Team', case=False)]        # This didn't work ('DataFrame' object has no attribute 'str'); not sure how to fix or to use different approach)
            # df = pd.Series([df.dropna()]).str.contains("Team")         # This filtered matching rows, not matching columns!
            df.to_csv(f"{file}-{sheet}.csv", index=False)
Example input: (desired columns shown in red, within which search strings 'Team' and 'Total' shown in bold (case insensitive)

Spreadsheet 1:
Code:
Random descriptions and other text
More text


Action         Name      Dpt             1       2      3      4     5     Total
Break          J         Nursing         N/A     N/A    33     33    10    76
Review         N/A       All             N/A     N/A    10     12    5     27
COA            All       Study-team      200     10     300    20    0     530
Assmt          General   Admin-team      20      30     N/A    30    N/A   50
Spreadsheet 2:
Code:
Random text about the project
More random text        Text        More text


Action         Name     Sponsor     Responsible     Process    1       2      3      4      5      6     7     8     Total    Comments     Resources   
Set-up         J        PPD         Nursing         Added      N/A     N/A    33     33     10     76    40    40    232      Outstanding  None        None
Costing        J        PPD         Study-team      Added      40      N/A    N/A    10     12     5     27    32    131
Contact        All      Realto      Study-team      Pending    200     10     300    20     0      50    50    20    650
Monitoring     General  MOD         Admin-team      Pending    20      30     N/A    30     N/A    50    50    20    200      TBC
Monitoring2    MK       PPD         Admin-team      Pending    60      N/A    N/A    N/A    N/A    N/A   N/A   N/A   60
Resources      J        Realto                      Pending    60                                                    60
Closedown      MK       Realto                      Pending    N/A     N/A    N/A    N/A    N/A    N/A   N/A   300   300      TBC          Inhouse
Spreadsheet 3:
Code:
Line of text
Line of text
Line of text            Some text


Action        Specf     Team      Location    Cycle1     Cycle2      Cycle3    TotalCost    Startdate      
Break         V2        Admin     GBHD        40         60          40        140          2024-12-14
Review        V3        AS        Old         190        24          190       404          TBC 
COA           V3        TBC       Main        194        206         54        454 
Assmt         V2        None      General
Desired output:

Spreadsheet 1:
Code:


Dpt            Total
Nursing        76
All            27
Study-team     530
Admin-team     50
Spreadsheet 2:
Code:

More text     <--- This text not really desired, but I can remove these lines later


Responsible   Total  
Nursing       232      
Study-team    131
Study-team    650
Admin-team    200 
Admin-team    60
              60
              600
Spreadsheet 3:
Code:


Some text   <--- This text not really desired, but I can remove these lines later


Team      TotalCost       
Admin     140      
AS        404  
TBC       454 
None

Thank you

Last edited by lxs602; 05-25-2024 at 07:40 AM. Reason: Added more examples
 
Old 05-24-2024, 02:00 AM   #2
gerard4143
Member
 
Registered: Jan 2008
Location: P.E.I. Canada
Posts: 32

Rep: Reputation: 4
Have you heard of AWK or GAWK.

Code:
awk '{printf "%-17s %s\n", $3, $9}' datafile
Code:
Dpt               Total
Nursing           76
All               27
Study-team        530
Admin-team        50
Ooops! You are working with spreadsheets. My bad!

Last edited by gerard4143; 05-24-2024 at 02:11 AM.
 
Old 05-24-2024, 02:39 AM   #3
pan64
LQ Addict
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 22,702

Rep: Reputation: 7535Reputation: 7535Reputation: 7535Reputation: 7535Reputation: 7535Reputation: 7535Reputation: 7535Reputation: 7535Reputation: 7535Reputation: 7535Reputation: 7535
probably this helps: https://pandas.pydata.org/docs/getti...bset_data.html
 
1 members found this post helpful.
Old 05-24-2024, 03:47 AM   #4
Turbocapitalist
LQ Guru
 
Registered: Apr 2005
Distribution: Linux Mint, Devuan, OpenBSD
Posts: 7,506
Blog Entries: 3

Rep: Reputation: 3814Reputation: 3814Reputation: 3814Reputation: 3814Reputation: 3814Reputation: 3814Reputation: 3814Reputation: 3814Reputation: 3814Reputation: 3814Reputation: 3814
Of if you focus on openpyxl instead of pandas:

Code:
#!/usr/bin/python3                                                              
import os
import openpyxl   # https://openpyxl.readthedocs.io/en/stable/                  
import csv        # https://python.readthedocs.io/en/latest/library/csv.html    

cwd = os.path.abspath('')
files = os.listdir(cwd)

for file in files:                         # loop through Excel files           
    if file.endswith('.xls') or file.endswith('.xlsx'):
        excel_file = openpyxl.load_workbook(file)
        sheets = excel_file.sheetnames
        for sheet in sheets:
            print (f'Processing {sheet} from {file}')
            worksheet = excel_file[sheet]
            csvfile = csv.writer(open(file+"."+sheet+".csv",
                                      'w',
                                      newline=''),
                                 delimiter='\t',
                                 )

            # an exceptionally clumsy way:
            for row in worksheet:
                newrow = list()
                newrow.append(row[2])
                newrow.append(row[8])
                csvfile.writerow([cell.value for cell in newrow])
Or you could "just" convert to CSV and use a recent version of AWK as it recently gained support for reading CSV files.
 
Old 05-24-2024, 12:10 PM   #5
lxs602
Member
 
Registered: Oct 2018
Distribution: Ubuntu 24.04
Posts: 50

Original Poster
Rep: Reputation: 1
Quote:
Originally Posted by gerard4143 View Post
Have you heard of AWK or GAWK.

Code:
awk '{printf "%-17s %s\n", $3, $9}' datafile
Code:
Dpt               Total
Nursing           76
All               27
Study-team        530
Admin-team        50
Ooops! You are working with spreadsheets. My bad!

I could convert them all to csv, and search for text in columns using something other than python.

I can't use column numbers, as there are thousands of spreadsheets, and the search terms are all in different columns.

Last edited by lxs602; 05-24-2024 at 12:40 PM.
 
Old 05-24-2024, 12:42 PM   #6
lxs602
Member
 
Registered: Oct 2018
Distribution: Ubuntu 24.04
Posts: 50

Original Poster
Rep: Reputation: 1
Quote:
Originally Posted by Turbocapitalist View Post
Of if you focus on openpyxl instead of pandas:

Code:
#!/usr/bin/python3                                                              
import os
import openpyxl   # https://openpyxl.readthedocs.io/en/stable/                  
import csv        # https://python.readthedocs.io/en/latest/library/csv.html    

cwd = os.path.abspath('')
files = os.listdir(cwd)

for file in files:                         # loop through Excel files           
    if file.endswith('.xls') or file.endswith('.xlsx'):
        excel_file = openpyxl.load_workbook(file)
        sheets = excel_file.sheetnames
        for sheet in sheets:
            print (f'Processing {sheet} from {file}')
            worksheet = excel_file[sheet]
            csvfile = csv.writer(open(file+"."+sheet+".csv",
                                      'w',
                                      newline=''),
                                 delimiter='\t',
                                 )

            # an exceptionally clumsy way:
            for row in worksheet:
                newrow = list()
                newrow.append(row[2])
                newrow.append(row[8])
                csvfile.writerow([cell.value for cell in newrow])
Or you could "just" convert to CSV and use a recent version of AWK as it recently gained support for reading CSV files.
How would this search for terms in the text of the columns, and then select only those columns?
 
Old 05-24-2024, 12:52 PM   #7
lxs602
Member
 
Registered: Oct 2018
Distribution: Ubuntu 24.04
Posts: 50

Original Poster
Rep: Reputation: 1
Most of the information I find seems to be either to:
- search for strings in rows, rather than search for strings in columns.
- select columns by column name, and not by text in the body of the column. My spreadsheets have no column names.

I added two more example spreadsheets to the first post.

It seems more complex than I thought.

Last edited by lxs602; 05-25-2024 at 07:03 AM.
 
Old 05-24-2024, 12:53 PM   #8
Turbocapitalist
LQ Guru
 
Registered: Apr 2005
Distribution: Linux Mint, Devuan, OpenBSD
Posts: 7,506
Blog Entries: 3

Rep: Reputation: 3814Reputation: 3814Reputation: 3814Reputation: 3814Reputation: 3814Reputation: 3814Reputation: 3814Reputation: 3814Reputation: 3814Reputation: 3814Reputation: 3814
Quote:
Originally Posted by lxs602 View Post
How would this search for terms in the text of the columns, and then select only those columns?
If the first row (or some other row) of each table has the terms you are looking for, have the script figure out which column numbers to use before iterating through the rest of the row for that table.
 
1 members found this post helpful.
Old 05-24-2024, 08:42 PM   #9
michaelk
Moderator
 
Registered: Aug 2002
Posts: 26,128

Rep: Reputation: 6076Reputation: 6076Reputation: 6076Reputation: 6076Reputation: 6076Reputation: 6076Reputation: 6076Reputation: 6076Reputation: 6076Reputation: 6076Reputation: 6076
Can you determine the type of spreadsheet from the name or from any particular keyword inside the spreadsheet? If not then maybe the spreadsheet's metadata might contain some identifying information that could determine the type. Otherwise I am not sure how you will determine which columns. While you could search for the desired keyword the column for desired number i.e total or totalcost is not the same which makes it a bit difficult.
 
Old 05-25-2024, 02:58 AM   #10
pan64
LQ Addict
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 22,702

Rep: Reputation: 7535Reputation: 7535Reputation: 7535Reputation: 7535Reputation: 7535Reputation: 7535Reputation: 7535Reputation: 7535Reputation: 7535Reputation: 7535Reputation: 7535
what I see is still the same
You want two columns. This is easily made by: df(["one", "two"]). The only problem I don't know how do you want to pick the name of those columns.
And how do you want to identify those "random" lines.
 
Old 05-25-2024, 06:25 AM   #11
lxs602
Member
 
Registered: Oct 2018
Distribution: Ubuntu 24.04
Posts: 50

Original Poster
Rep: Reputation: 1
Quote:
Originally Posted by pan64 View Post
what I see is still the same
You want two columns. This is easily made by: df(["one", "two"]). The only problem I don't know how do you want to pick the name of those columns.
And how do you want to identify those "random" lines.
Somewhere in the text of the columns (but not in the column names) will be keywords, for example, 'team', or 'cost', or 'lab', etc.
  • A column with a matching keyword somewhere in the body of their text should be kept, including all its rows.
  • Columns have no names - the data is from Excel spreadsheets

The task is that these are messy Excel spreadsheets (all originally based on a template), which need tidying up. This is to sum cost per team for each project (each spreadsheet), and other basic analysis.

Last edited by lxs602; 05-25-2024 at 07:05 AM.
 
Old 05-25-2024, 07:07 AM   #12
lxs602
Member
 
Registered: Oct 2018
Distribution: Ubuntu 24.04
Posts: 50

Original Poster
Rep: Reputation: 1
Quote:
Originally Posted by michaelk View Post
Can you determine the type of spreadsheet from the name or from any particular keyword inside the spreadsheet? If not then maybe the spreadsheet's metadata might contain some identifying information that could determine the type. Otherwise I am not sure how you will determine which columns. While you could search for the desired keyword the column for desired number i.e total or totalcost is not the same which makes it a bit difficult.
I think Turbocapitalist is right - I need to think about searching rows to find columns with a keyword, and then pick columns based on that for each spreadsheet.
 
Old 05-25-2024, 07:14 AM   #13
pan64
LQ Addict
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 22,702

Rep: Reputation: 7535Reputation: 7535Reputation: 7535Reputation: 7535Reputation: 7535Reputation: 7535Reputation: 7535Reputation: 7535Reputation: 7535Reputation: 7535Reputation: 7535
Your specification does not fit, there is no "cost" in your examples, you cannot use it to specify anything. Otherwise it means at least 3 steps:
1. remove unnecessary lines
2. identify necessary columns
3. construct the result

I would split this task into parts and solve them one by one. The question is not the last step, but the previous ones.
 
Old 05-25-2024, 07:38 AM   #14
lxs602
Member
 
Registered: Oct 2018
Distribution: Ubuntu 24.04
Posts: 50

Original Poster
Rep: Reputation: 1
Quote:
Originally Posted by pan64 View Post
Your specification does not fit, there is no "cost" in your examples, you cannot use it to specify anything. Otherwise it means at least 3 steps:
Apologies for the typo. Thanks for pointing that out. I have corrected the first post.

Last edited by lxs602; 05-25-2024 at 07:39 AM.
 
Old 05-25-2024, 09:41 AM   #15
pan64
LQ Addict
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 22,702

Rep: Reputation: 7535Reputation: 7535Reputation: 7535Reputation: 7535Reputation: 7535Reputation: 7535Reputation: 7535Reputation: 7535Reputation: 7535Reputation: 7535Reputation: 7535
as an extremely dumb solution you can just pick all the columns you need (Dpt, TotalCost, Responsible, Team, Total), some of them will be just empty. If you can cut everything before the first usable line.
 
  


Reply

Tags
beginner, python


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
transpose selected groups of rows of columns into multiple columns (arrays) with other columns retained to single row-group TempleStone4510! Linux - General 15 08-22-2022 06:52 PM
LXer: Select Pandas Dataframe Rows And Columns Using iloc loc and ix LXer Syndicated Linux News 0 11-28-2019 12:55 PM
LXer: Python explosion blamed on pandas LXer Syndicated Linux News 0 09-16-2017 07:07 PM
[SOLVED] sum a value of column based on the value of other columns umix Linux - Newbie 5 04-11-2013 03:15 PM
difference between value *value and value * value PoleStar Linux - Newbie 1 11-26-2010 03:37 PM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 02:07 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration