LinuxQuestions.org
Help answer threads with 0 replies.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - General
User Name
Password
Linux - General This Linux forum is for general Linux questions and discussion.
If it is Linux Related and doesn't seem to fit in any other forum then this is the place.

Notices


Reply
  Search this Thread
Old 02-02-2024, 04:53 PM   #1
Fluer
LQ Newbie
 
Registered: Feb 2024
Posts: 7

Rep: Reputation: 0
Most actual programs for Data Engineer


Hello,

Nice to be here! I want to start carrier as Data Engineer. I have beginner skills in Postgresql and Pandas. I am interested in scraping data, for example prices from internet shops. I want to know about scraping framework: is Scrapy still in use? How scraping looks like in real job? I collect data from scrapy to json, then create tables in Pandas and after that in postgresql? Or I should save scraping data (from Scrapy) straight into postgresql tables and after that use pandas for data manipulation? Or there is a better approach then Scrapy for grabbing data from internet?

Thank you,
Fluer
 
Old 02-03-2024, 10:19 AM   #2
TB0ne
LQ Guru
 
Registered: Jul 2003
Location: Birmingham, Alabama
Distribution: SuSE, RedHat, Slack,CentOS
Posts: 26,636

Rep: Reputation: 7965Reputation: 7965Reputation: 7965Reputation: 7965Reputation: 7965Reputation: 7965Reputation: 7965Reputation: 7965Reputation: 7965Reputation: 7965Reputation: 7965
Quote:
Originally Posted by Fluer View Post
Hello,
Nice to be here! I want to start carrier as Data Engineer. I have beginner skills in Postgresql and Pandas. I am interested in scraping data, for example prices from internet shops. I want to know about scraping framework: is Scrapy still in use? How scraping looks like in real job? I collect data from scrapy to json, then create tables in Pandas and after that in postgresql? Or I should save scraping data (from Scrapy) straight into postgresql tables and after that use pandas for data manipulation? Or there is a better approach then Scrapy for grabbing data from internet?
This is a question that has no real answer. What is 'in use' from one shop to another varies wildly, so there isn't a "you should use this and get a job" answer. You need to have depth of knowledge/skills, not just know how to use one tool and one database.

What if they want to use Oracle as a database?? XML instead of JSON??? Or have their OWN scraping utility they wrote in house? Develop your skills overall.
 
1 members found this post helpful.
Old 02-03-2024, 10:45 AM   #3
business_kid
LQ Guru
 
Registered: Jan 2006
Location: Ireland
Distribution: Slackware, Slarm64 & Android
Posts: 16,307

Rep: Reputation: 2324Reputation: 2324Reputation: 2324Reputation: 2324Reputation: 2324Reputation: 2324Reputation: 2324Reputation: 2324Reputation: 2324Reputation: 2324Reputation: 2324
I think it's fair to say you need
  • A CS degree
  • Good relevant experience unless you're paid as a trainee.

For any job these days. It's also totally the wrong time, because everyone is laying off folks with good experience. You'll need several languages, not just one. There's nobody making money by using scrapy to scrape sites. Do you even have good python3?

A scraper is put together in a day or two using python modules if it's needed. People with very good CVs are out there also looking for work. What can you do that they can't?
 
1 members found this post helpful.
Old 02-03-2024, 10:59 AM   #4
Fluer
LQ Newbie
 
Registered: Feb 2024
Posts: 7

Original Poster
Rep: Reputation: 0
TBOne,thank you! I enjoy working with parsers. I know it used to be a respectable profession in the past. However, nowadays, I frequently hear on TV that everyone should know how to work with neural networks, AI, and machine learning. What if the role of a Data Engineer (due to ChatGPT or something else) fades away as a profession, or am I worrying for nothing?"
 
Old 02-03-2024, 11:27 AM   #5
Fluer
LQ Newbie
 
Registered: Feb 2024
Posts: 7

Original Poster
Rep: Reputation: 0
Business_kid, thank you!

I know Python, but I don't think it's good enough. I agree with you because I've noticed that Scrapy is not listed in most Data Engineer job postings. This uncertainty makes me question its necessity. Currently, I am employed as a site administrator (working with text, HTML, CSS, Photoshop, and CMS). However, my aspiration is to work in the field of Big Data analytics. I understand that I need time to learn and practice (I am willing to work for free initially).

Can you advise me on the programming languages and Python modules I should learn and practice to align more closely with my dream profession?
 
Old 02-03-2024, 11:59 AM   #6
TB0ne
LQ Guru
 
Registered: Jul 2003
Location: Birmingham, Alabama
Distribution: SuSE, RedHat, Slack,CentOS
Posts: 26,636

Rep: Reputation: 7965Reputation: 7965Reputation: 7965Reputation: 7965Reputation: 7965Reputation: 7965Reputation: 7965Reputation: 7965Reputation: 7965Reputation: 7965Reputation: 7965
Quote:
Originally Posted by Fluer View Post
TBOne,thank you! I enjoy working with parsers. I know it used to be a respectable profession in the past. However, nowadays, I frequently hear on TV that everyone should know how to work with neural networks, AI, and machine learning. What if the role of a Data Engineer (due to ChatGPT or something else) fades away as a profession, or am I worrying for nothing?"
Again, there is no 'right' answer...there are ALWAYS going to be new things, as well as older things. You need DEPTH of knowledge of both, and knowing how to use the tools at your disposal, period. Saying "Data engineer" is absolutely meaningless, since that's an arbitrary job title that can mean a million different things. What you have to do at one company will be completely different at another, for the same role.
Quote:
Originally Posted by Fluer
I know Python, but I don't think it's good enough. I agree with you because I've noticed that Scrapy is not listed in most Data Engineer job postings. This uncertainty makes me question its necessity. Currently, I am employed as a site administrator (working with text, HTML, CSS, Photoshop, and CMS). However, my aspiration is to work in the field of Big Data analytics. I understand that I need time to learn and practice (I am willing to work for free initially).

Can you advise me on the programming languages and Python modules I should learn and practice to align more closely with my dream profession?
If you can't take the time to learn what you need to and do your own research about a job that YOU WANT, how do you expect to GET that job??? Again, "data engineer" is a meaningless title...find some jobs that you want, and look at what you need to know to get that job....that is what you need to learn. Unless you can show initiative on your own to do your own research and learn, you won't get much in the way of jobs anywhere. Again, there aren't any magic things that say, "if you know X, Y, and Z, you can get a job as this!!"...just doesn't work that way.
 
1 members found this post helpful.
Old 02-03-2024, 03:26 PM   #7
Fluer
LQ Newbie
 
Registered: Feb 2024
Posts: 7

Original Poster
Rep: Reputation: 0
Quote:
...find some jobs that you want, and look at what you need to know to get that job...
TBOne, thank you! You are right. I googled vacancies again and I decided to learn Apache Airflow, requests and sqlalchemy first. I found example of code. If it really can be used in real project?

Code:
from datetime import datetime, timedelta
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
import requests
import pandas as pd
from sqlalchemy import create_engine

# Define DAG parameters
default_args = {
    'owner': 'data_engineer',
    'depends_on_past': False,
    'start_date': datetime(2024, 2, 3),
    'retries': 1,
    'retry_delay': timedelta(minutes=5),
}

dag = DAG(
    'data_pipeline_example',
    default_args=default_args,
    description='A simple data pipeline example',
    schedule_interval=timedelta(days=1),  # Run daily
)

# Define tasks
def fetch_data():
    api_url = 'https://example-api.com/data'
    response = requests.get(api_url)
    data = response.json()
    return data

def clean_and_store_data(**kwargs):
    ti = kwargs['ti']
    data = ti.xcom_pull(task_ids='fetch_data_task')
    
    # Perform data cleaning (modify as needed)
    cleaned_data = pd.DataFrame(data)
    
    # Store data in PostgreSQL (replace connection string)
    engine = create_engine('postgresql://username:password@localhost:5432/database')
    cleaned_data.to_sql('your_table', engine, if_exists='replace', index=False)

# Set up tasks in the DAG
fetch_data_task = PythonOperator(
    task_id='fetch_data_task',
    python_callable=fetch_data,
    dag=dag,
)

clean_and_store_data_task = PythonOperator(
    task_id='clean_and_store_data_task',
    python_callable=clean_and_store_data,
    provide_context=True,
    dag=dag,
)

# Define task dependencies
fetch_data_task >> clean_and_store_data_task
 
Old 02-04-2024, 04:53 AM   #8
business_kid
LQ Guru
 
Registered: Jan 2006
Location: Ireland
Distribution: Slackware, Slarm64 & Android
Posts: 16,307

Rep: Reputation: 2324Reputation: 2324Reputation: 2324Reputation: 2324Reputation: 2324Reputation: 2324Reputation: 2324Reputation: 2324Reputation: 2324Reputation: 2324Reputation: 2324
I see a catch-22 that you are in danger of walking into.

In my business (Electronic Hardware) likewise, there were numerous 'latest things' during my career. What that means is that experience in them translates quickly into a job. But by the time you do a course on them, the latest thing may have changed, or the angle may have changed, because things never stay static.

Personally I think the world is waiting for 'underwear for AI' to be developed, conceived or programmed. AI with the wrong prompting can look very silly indeed if not protected. It's much like "The Emperor's new suit of clothes" in the fairy tale, as has been shown more than once.
https://www.youtube.com/watch?v=G55Oq3oBls0

Last edited by business_kid; 02-04-2024 at 04:54 AM.
 
1 members found this post helpful.
Old 02-04-2024, 10:20 AM   #9
TB0ne
LQ Guru
 
Registered: Jul 2003
Location: Birmingham, Alabama
Distribution: SuSE, RedHat, Slack,CentOS
Posts: 26,636

Rep: Reputation: 7965Reputation: 7965Reputation: 7965Reputation: 7965Reputation: 7965Reputation: 7965Reputation: 7965Reputation: 7965Reputation: 7965Reputation: 7965Reputation: 7965
Quote:
Originally Posted by Fluer View Post
TBOne, thank you! You are right. I googled vacancies again and I decided to learn Apache Airflow, requests and sqlalchemy first. I found example of code. If it really can be used in real project?
This will be the third time you've been told this: there is NO WAY TO KNOW. How, exactly, is anyone here (or ANYWHERE) going to know that yes, this particular piece of sample code will 100% be used in a 'real project' at some unspecified company for some unspecified reason???

Pay attention: DEVELOP YOUR SKILLS, all of them. Learn to think and troubleshoot, pay attention to the answers you get when you ask questions and think about how they apply.
 
1 members found this post helpful.
Old 02-04-2024, 06:44 PM   #10
Fluer
LQ Newbie
 
Registered: Feb 2024
Posts: 7

Original Poster
Rep: Reputation: 0
Quote:
Personally I think the world is waiting for 'underwear for AI' to be developed, conceived or programmed.
Yes, "prompt engineering" is a new and well-paid profession. When I ask ChatGPT to provide hacker tools, it says, "I can't assist." But when I ask about pentester tools, it tells me all about cracking. It would be awesome if AI would have all "underwear" (API or something) for every task.
 
Old 02-04-2024, 06:57 PM   #11
Fluer
LQ Newbie
 
Registered: Feb 2024
Posts: 7

Original Poster
Rep: Reputation: 0
Quote:
Originally Posted by TB0ne View Post

Pay attention: DEVELOP YOUR SKILLS, all of them. Learn to think and troubleshoot, pay attention to the answers you get when you ask questions and think about how they apply.
I am sorry. The marketing department (where I work as a site administrator) starts new ads. I decided to compile customers opinions about our product before and after new ads: collect information, clean and analyze it, and make visualizations and solutions for the future of selling. I will play around with some tools and see how they will work in several cases. I hope I will learn how to think and troubleshoot this way.
 
Old 02-06-2024, 03:00 PM   #12
Fluer
LQ Newbie
 
Registered: Feb 2024
Posts: 7

Original Poster
Rep: Reputation: 0
Thank you very much for the answers! I am starting to learn things for my new profession with great help from your advices.
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Mechanical Engineer Learning to be Systems Engineer BlueSpirit44 LinuxQuestions.org Member Intro 0 05-19-2022 11:22 AM
Entry level engineer requested to participate in interview of senior engineer. vysero Linux - General 1 01-28-2019 02:39 PM
Career: Network Engineer vs Software Engineer ? JESSEJJ89 General 6 05-02-2013 04:10 PM
Location of actual mysql tables and data on RH8 rsleventhal Linux - Server 4 04-01-2008 06:30 AM
Reading audio data files as an actual audio file? Erik_the_Red Linux - Software 1 06-01-2005 07:22 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - General

All times are GMT -5. The time now is 04:32 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration