Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game. |
Notices |
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
Are you new to LinuxQuestions.org? Visit the following links:
Site Howto |
Site FAQ |
Sitemap |
Register Now
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
|
|
|
05-23-2024, 09:59 AM
|
#1
|
Member
Registered: Oct 2018
Distribution: Ubuntu 24.04
Posts: 50
Rep:
|
Selecting columns containing a value (Python, pandas)
Hi,
I am a beginner with using Python.
Task:
From a few thousand excel spreadsheets (.xlsx) in the current directory, I would like to save each sheet separately, but only columns somewhere containing certain strings (e.g. 'Team', 'Total').
I have the code so far to save the sheets, but I am lost in guides and documentation as to select the columns with matching data. The columns have no headers.
General aim:
...To impress the boss! I had wanted to learn python/R/Excel for analysing data, but if this goes well, perhaps they might pay for a few exams in Excel. I am working in admin at the moment, for which I am grateful, but I would like to do something more analytical. This task came up just as I have been starting to learn Python, so some of this is still ahead of me at the moment.
Code:
import pandas as pd
import os
import openpyxl
# print("Copying sheets from multiple files to one file")
cwd = os.path.abspath('')
files = os.listdir(cwd)
for file in files: # loop through Excel files
if file.endswith('.xls') or file.endswith('.xlsx'):
excel_file = pd.ExcelFile(file)
sheets = excel_file.sheet_names
for sheet in sheets: # loop through sheets inside an Excel file
print (file, sheet)
df = excel_file.parse(sheet_name = sheet)
df.dropna(inplace=True)
# df[df.dropna().str.contains('Total|Team', case=False)] # This didn't work ('DataFrame' object has no attribute 'str'); not sure how to fix or to use different approach)
# df = pd.Series([df.dropna()]).str.contains("Team") # This filtered matching rows, not matching columns!
df.to_csv(f"{file}-{sheet}.csv", index=False)
Example input: (desired columns shown in red, within which search strings 'Team' and ' Total' shown in bold (case insensitive)
Spreadsheet 1:
Code:
Random descriptions and other text
More text
Action Name Dpt 1 2 3 4 5 Total
Break J Nursing N/A N/A 33 33 10 76
Review N/A All N/A N/A 10 12 5 27
COA All Study-team 200 10 300 20 0 530
Assmt General Admin-team 20 30 N/A 30 N/A 50
Spreadsheet 2:
Code:
Random text about the project
More random text Text More text
Action Name Sponsor Responsible Process 1 2 3 4 5 6 7 8 Total Comments Resources
Set-up J PPD Nursing Added N/A N/A 33 33 10 76 40 40 232 Outstanding None None
Costing J PPD Study-team Added 40 N/A N/A 10 12 5 27 32 131
Contact All Realto Study-team Pending 200 10 300 20 0 50 50 20 650
Monitoring General MOD Admin-team Pending 20 30 N/A 30 N/A 50 50 20 200 TBC
Monitoring2 MK PPD Admin-team Pending 60 N/A N/A N/A N/A N/A N/A N/A 60
Resources J Realto Pending 60 60
Closedown MK Realto Pending N/A N/A N/A N/A N/A N/A N/A 300 300 TBC Inhouse
Spreadsheet 3:
Code:
Line of text
Line of text
Line of text Some text
Action Specf Team Location Cycle1 Cycle2 Cycle3 TotalCost Startdate
Break V2 Admin GBHD 40 60 40 140 2024-12-14
Review V3 AS Old 190 24 190 404 TBC
COA V3 TBC Main 194 206 54 454
Assmt V2 None General
Desired output:
Spreadsheet 1:
Code:
Dpt Total
Nursing 76
All 27
Study-team 530
Admin-team 50
Spreadsheet 2:
Code:
More text <--- This text not really desired, but I can remove these lines later
Responsible Total
Nursing 232
Study-team 131
Study-team 650
Admin-team 200
Admin-team 60
60
600
Spreadsheet 3:
Code:
Some text <--- This text not really desired, but I can remove these lines later
Team TotalCost
Admin 140
AS 404
TBC 454
None
Thank you
Last edited by lxs602; 05-25-2024 at 07:40 AM.
Reason: Added more examples
|
|
|
05-24-2024, 02:00 AM
|
#2
|
Member
Registered: Jan 2008
Location: P.E.I. Canada
Posts: 32
Rep:
|
Have you heard of AWK or GAWK.
Code:
awk '{printf "%-17s %s\n", $3, $9}' datafile
Code:
Dpt Total
Nursing 76
All 27
Study-team 530
Admin-team 50
Ooops! You are working with spreadsheets. My bad!
Last edited by gerard4143; 05-24-2024 at 02:11 AM.
|
|
|
05-24-2024, 02:39 AM
|
#3
|
LQ Addict
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 22,702
|
|
|
1 members found this post helpful.
|
05-24-2024, 03:47 AM
|
#4
|
LQ Guru
Registered: Apr 2005
Distribution: Linux Mint, Devuan, OpenBSD
Posts: 7,506
|
Of if you focus on openpyxl instead of pandas:
Code:
#!/usr/bin/python3
import os
import openpyxl # https://openpyxl.readthedocs.io/en/stable/
import csv # https://python.readthedocs.io/en/latest/library/csv.html
cwd = os.path.abspath('')
files = os.listdir(cwd)
for file in files: # loop through Excel files
if file.endswith('.xls') or file.endswith('.xlsx'):
excel_file = openpyxl.load_workbook(file)
sheets = excel_file.sheetnames
for sheet in sheets:
print (f'Processing {sheet} from {file}')
worksheet = excel_file[sheet]
csvfile = csv.writer(open(file+"."+sheet+".csv",
'w',
newline=''),
delimiter='\t',
)
# an exceptionally clumsy way:
for row in worksheet:
newrow = list()
newrow.append(row[2])
newrow.append(row[8])
csvfile.writerow([cell.value for cell in newrow])
Or you could "just" convert to CSV and use a recent version of AWK as it recently gained support for reading CSV files.
|
|
|
05-24-2024, 12:10 PM
|
#5
|
Member
Registered: Oct 2018
Distribution: Ubuntu 24.04
Posts: 50
Original Poster
Rep:
|
Quote:
Originally Posted by gerard4143
Have you heard of AWK or GAWK.
Code:
awk '{printf "%-17s %s\n", $3, $9}' datafile
Code:
Dpt Total
Nursing 76
All 27
Study-team 530
Admin-team 50
Ooops! You are working with spreadsheets. My bad!
|
I could convert them all to csv, and search for text in columns using something other than python.
I can't use column numbers, as there are thousands of spreadsheets, and the search terms are all in different columns.
Last edited by lxs602; 05-24-2024 at 12:40 PM.
|
|
|
05-24-2024, 12:42 PM
|
#6
|
Member
Registered: Oct 2018
Distribution: Ubuntu 24.04
Posts: 50
Original Poster
Rep:
|
Quote:
Originally Posted by Turbocapitalist
Of if you focus on openpyxl instead of pandas:
Code:
#!/usr/bin/python3
import os
import openpyxl # https://openpyxl.readthedocs.io/en/stable/
import csv # https://python.readthedocs.io/en/latest/library/csv.html
cwd = os.path.abspath('')
files = os.listdir(cwd)
for file in files: # loop through Excel files
if file.endswith('.xls') or file.endswith('.xlsx'):
excel_file = openpyxl.load_workbook(file)
sheets = excel_file.sheetnames
for sheet in sheets:
print (f'Processing {sheet} from {file}')
worksheet = excel_file[sheet]
csvfile = csv.writer(open(file+"."+sheet+".csv",
'w',
newline=''),
delimiter='\t',
)
# an exceptionally clumsy way:
for row in worksheet:
newrow = list()
newrow.append(row[2])
newrow.append(row[8])
csvfile.writerow([cell.value for cell in newrow])
Or you could "just" convert to CSV and use a recent version of AWK as it recently gained support for reading CSV files.
|
How would this search for terms in the text of the columns, and then select only those columns?
|
|
|
05-24-2024, 12:52 PM
|
#7
|
Member
Registered: Oct 2018
Distribution: Ubuntu 24.04
Posts: 50
Original Poster
Rep:
|
Most of the information I find seems to be either to:
- search for strings in rows, rather than search for strings in columns.
- select columns by column name, and not by text in the body of the column. My spreadsheets have no column names.
I added two more example spreadsheets to the first post.
It seems more complex than I thought.
Last edited by lxs602; 05-25-2024 at 07:03 AM.
|
|
|
05-24-2024, 12:53 PM
|
#8
|
LQ Guru
Registered: Apr 2005
Distribution: Linux Mint, Devuan, OpenBSD
Posts: 7,506
|
Quote:
Originally Posted by lxs602
How would this search for terms in the text of the columns, and then select only those columns?
|
If the first row (or some other row) of each table has the terms you are looking for, have the script figure out which column numbers to use before iterating through the rest of the row for that table.
|
|
1 members found this post helpful.
|
05-24-2024, 08:42 PM
|
#9
|
Moderator
Registered: Aug 2002
Posts: 26,128
|
Can you determine the type of spreadsheet from the name or from any particular keyword inside the spreadsheet? If not then maybe the spreadsheet's metadata might contain some identifying information that could determine the type. Otherwise I am not sure how you will determine which columns. While you could search for the desired keyword the column for desired number i.e total or totalcost is not the same which makes it a bit difficult.
|
|
|
05-25-2024, 02:58 AM
|
#10
|
LQ Addict
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 22,702
|
what I see is still the same
You want two columns. This is easily made by: df(["one", "two"]). The only problem I don't know how do you want to pick the name of those columns.
And how do you want to identify those "random" lines.
|
|
|
05-25-2024, 06:25 AM
|
#11
|
Member
Registered: Oct 2018
Distribution: Ubuntu 24.04
Posts: 50
Original Poster
Rep:
|
Quote:
Originally Posted by pan64
what I see is still the same
You want two columns. This is easily made by: df(["one", "two"]). The only problem I don't know how do you want to pick the name of those columns.
And how do you want to identify those "random" lines.
|
Somewhere in the text of the columns (but not in the column names) will be keywords, for example, 'team', or 'cost', or 'lab', etc.
- A column with a matching keyword somewhere in the body of their text should be kept, including all its rows.
- Columns have no names - the data is from Excel spreadsheets
The task is that these are messy Excel spreadsheets (all originally based on a template), which need tidying up. This is to sum cost per team for each project (each spreadsheet), and other basic analysis.
Last edited by lxs602; 05-25-2024 at 07:05 AM.
|
|
|
05-25-2024, 07:07 AM
|
#12
|
Member
Registered: Oct 2018
Distribution: Ubuntu 24.04
Posts: 50
Original Poster
Rep:
|
Quote:
Originally Posted by michaelk
Can you determine the type of spreadsheet from the name or from any particular keyword inside the spreadsheet? If not then maybe the spreadsheet's metadata might contain some identifying information that could determine the type. Otherwise I am not sure how you will determine which columns. While you could search for the desired keyword the column for desired number i.e total or totalcost is not the same which makes it a bit difficult.
|
I think Turbocapitalist is right - I need to think about searching rows to find columns with a keyword, and then pick columns based on that for each spreadsheet.
|
|
|
05-25-2024, 07:14 AM
|
#13
|
LQ Addict
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 22,702
|
Your specification does not fit, there is no "cost" in your examples, you cannot use it to specify anything. Otherwise it means at least 3 steps:
1. remove unnecessary lines
2. identify necessary columns
3. construct the result
I would split this task into parts and solve them one by one. The question is not the last step, but the previous ones.
|
|
|
05-25-2024, 07:38 AM
|
#14
|
Member
Registered: Oct 2018
Distribution: Ubuntu 24.04
Posts: 50
Original Poster
Rep:
|
Quote:
Originally Posted by pan64
Your specification does not fit, there is no "cost" in your examples, you cannot use it to specify anything. Otherwise it means at least 3 steps:
|
Apologies for the typo. Thanks for pointing that out. I have corrected the first post.
Last edited by lxs602; 05-25-2024 at 07:39 AM.
|
|
|
05-25-2024, 09:41 AM
|
#15
|
LQ Addict
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 22,702
|
as an extremely dumb solution you can just pick all the columns you need (Dpt, TotalCost, Responsible, Team, Total), some of them will be just empty. If you can cut everything before the first usable line.
|
|
|
All times are GMT -5. The time now is 02:07 PM.
|
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.
|
Latest Threads
LQ News
|
|