The Life of a Data Engineer



Download 1.33 Mb.
Page1/3
Date09.06.2018
Size1.33 Mb.
#53487
  1   2   3
The Life of a Data Engineer

Data engineers build massive reservoirs for big data. They develop, construct, test and maintain architectures such as databases and large-scale data processing systems. Once continuous pipelines are installed to – and from – these huge “pools” of filtered information, data scientists can pull relevant data sets for their analysis

In his/her role as a hardcore builder, a data engineer may be required to:


  • Design, construct, install, test and maintain highly scalable data management systems

  • Ensure systems meet business requirements and industry practices

  • Build high-performance algorithms, prototypes, predictive models and proof of concepts

  • Research opportunities for data acquisition and new uses for existing data

  • Develop data set processes for data modeling, mining and production

  • Integrate new data management technologies and software engineering tools into existing structures

  • Create custom software components (e.g. specialized UDFs) and analytics applications

  • Employ a variety of languages and tools (e.g. scripting languages) to marry systems together

  • Install and update disaster recovery procedures

  • Recommend ways to improve data reliability, efficiency and quality

  • Collaborate with data architects, modelers and IT team members on project goals

Data engineers may work closely with data architects (to determine what data management systems are appropriate) and data scientists (to determine which data are needed for analysis). They often wrestle with problems associated with database integration and messy, unstructured data sets. Their ultimate aim is to provide clean, usable data to whomever may require it.

An Interview with a Real Data Engineer


HOW DO DATA ENGINEERS HELP URTHECAST SUCCEED?

A: One of the primary facets of Urthecast’s business is to provide our data for others to consume in a variety of ways, whether their focus involves science, government, education, or business. In essence, we provide data for others to analyze and build upon. Thus, the impact of our data engineers is extremely important. Our success lies in both the quality and the quantity of what we can offer. Because of this, we have data engineers working in a variety of capacities along the entire data pipeline – from working with the raw data, perfecting it with geospatial raster processes (georeferencing, orthorectification, and mosaicing), all the way to building APIs for developer access.

In addition, as with most companies, we manage data beyond our core product. For example, we analyze log files and customer-use patterns. All companies benefit from this knowledge, which turns into useful business metrics.

Q: WHICH PROGRAMMING LANGUAGES DO YOU MOST FREQUENTLY USE?

A: The skills and tools that are utilized on the job are highly dependent on which part of the data pipeline you focus on. For myself, I’m at the tail end of the pipeline building APIs for data consumption, integrating external datasets, and analyzing how our data is used to further improve our end product.

With APIs, I really feel web languages are sufficiently robust, so it’s not as important which one you choose as long as it is embraced as a common language amongst your team. Our environment relies heavily on both PHP and Python. Almost all of my code relating to data ingestion from other providers is written in Python. It is uncomplicated and robust and can talk to any datastore whether it’s RDBMS or NoSQL. Lastly, for data analysis, I use Big Data technologies, such as Spark, to forecast and recommend improvements based on how the data is consumed.

Q: HOW IS THE ROLE OF DATA ENGINEERS CHANGING/EVOLVING?

A: We are in a data revolution, and these are exciting times. Data used to be viewed as a simple necessity and lower on the totem pole. Now it is more widely recognized as the source of truth. As we move into more complex systems of data management, the role of the data engineer becomes extremely important as a bridge between the DBA and the data consumer. Beyond the ubiquitous spreadsheet, graduating from RDBMS (which will always have a place in the data stack), we now work with NoSQL and Big Data technologies.

As the tools and processes become more complex, and because raw data is always dirty, data engineers will always have a place in the workforce. I do think the tools we use will become more refined and more powerful, but I don’t see raw data ever arriving clean. Also, I think we will have new, as-yet-unknown to us, data models that will keep things fresh and keep data engineers always learning.

Q: WHAT’S THE DIFFERENCE BETWEEN DATA ENGINEERS AND DATA SCIENTISTS?

A: Data engineers are the plumbers building a data pipeline, while data scientists are the painters and storytellers giving meaning to an otherwise static entity. Simply put, data engineers clean, prepare and optimize data for consumption. Once the data becomes useful, data scientists can perform a variety of analyses and visualization techniques to truly understand the data, and eventually, tell a story from the data. All data has a story to tell.

The communication between a data engineer and a data scientist is vital. Typically, data is not just thrown in a database awaiting consumption. It needs to be optimized to the use case of the data scientist. Having a clear understanding of how this handshake occurs is important in reducing the human error component of the data pipeline.

Personally, I’m a fan of providing data access via an API. This allows scientists to focus on what they can do with the data rather than how to access the data. Not everyone understands SQL, and not everyone writes good SQL. PDFs and spreadsheets have their place in the board room. With a well-written RESTful API, the data engineer is able to provide the data scientist with either exactly what they want or the means to access raw data and then build their final product.

Lastly, I’ll just say that it’s important for data scientists to be appreciative of an engineer’s work. Last year, the NY Times wrote that 50 to 80 percent of a data scientist’s job is cleaning data. That is not the case once you have a team of data engineers on board, allowing the data scientist to focus on analytics.

Q: WHAT KIND OF PERSON MAKES THE BEST DATA ENGINEER?

A: I feel a data engineer should have the following traits:



  • Mechanical tendencies. A curiosity to know how things work and how to make them better.

  • Patience. Nothing will work the first time; there are just too many moving parts.

  • Humility. Data engineers are the wizards of Oz. Ultimately, you are in a support role; you help build the underlying infrastructure. Be proud of your work and know that others may get more of the limelight because of your efforts.

  • Focus. Designing data is one of my favorite aspects of my work, but it tends to be a smaller percentage of my day. A data engineer should want to be in the weeds, understanding the intricacies of how and why a data pipeline works as it does.

Q: WHAT ADVICE WOULD YOU OFFER DATA ENGINEERING STUDENTS?

A: Of course it’s important to be fluent in the languages and tools that will help you get hired. But more important, I believe, is to understand what the tools are helping you to accomplish. Languages come and go, so it’s better to gain a full understanding of the concepts behind building a robust pipeline.

Also, be extremely comfortable at the command line. Text files still reign supreme, whether it’s your own code, csv/json/xml data, or log files.

Lastly, find a community and get involved! Check meetup.com for something in your area or local universities, which may have study groups that you can join. Keep a lookout for hackathons – they always need data specialists.


Data Engineer Salaries


As you might expect, Silicon Valley is the center of well-paying IT jobs. According to PayScale, the median pay in 2015 for a Data Scientist/Engineer in San Francisco was $117,388 (25% above the national average). Seattle was in second place, with a median pay of $105,340 (12% above).

Data Engineer


Glassdoor
Average Salary (2015): $95,936 per year
Minimum: $66,000
Maximum: $117,000

Data Scientist/Engineer


PayScale
Median Salary (2015): $91,782 per year
Total Pay Range: $58,773 – $143,419

Senior Data Engineer


Glassdoor
Average Salary (2015): $124,338 per year
Minimum: $105,000
Maximum: $147,000

Big Data Engineer


Robert Half Technology 2015 Salary Guide
Average Salary (2014): $110,250 – $152,750
Average Salary (2015): $119,250 – $168,250

Data Engineer Qualifications

What Kind of Degree Will I Need?


You will need a bachelor’s degree in computer science, software/computer engineering, applied math, physics, statistics or a related field and a lot of real-world skills to qualify for most entry-level positions.

Is a master’s required? It depends on the job. Some employers are more than willing to accept relevant work experience and proof of technical expertise in lieu of a higher degree.


What Kind of Skills Will I Need?

Technical Skills


  • Statistical analysis and modeling

  • Database architectures

  • Hadoop-based technologies (e.g. MapReduce, Hive and Pig)

  • SQL-based technologies (e.g. PostgreSQL and MySQL)

  • NoSQL technologies (e.g. Cassandra and MongoDB)

  • Data modeling tools (e.g. ERWin, Enterprise Architect and Visio)

  • Python, C/C++ Java, Perl

  • MatLab, SAS, R

  • Data warehousing solutions

  • Predictive modeling, NLP and text analysis

  • Machine learning

  • Data mining

  • UNIX, Linux, Solaris and MS Windows

Since new data management technologies are appearing every day, this list is subject to change.

Business Skills


  • Creative Problem-Solving: Approaching data organization challenges with a clear eye on what is important; employing the right approach/methods to make the maximum use of time and human resources.

  • Effective Collaboration: Carefully listening to management, data scientists and data architects to establish their needs.

  • Intellectual Curiosity: Exploring new territories and finding creative and unusual ways to solve data management problems.

  • Industry Knowledge: Understanding the way your chosen industry functions and how data can be collected, analyzed and utilized; maintaining flexibility in the face of big data developments.

What About Certifications?


If you’re interested in buffing up specific skills, you’ll find a lot of vendor-specific certifications (e.g. Oracle, Microsoft, IBM, Cloudera etc.). To determine which ones are worth your investment, ask your mentors for advice, examine recent job descriptions and browse articles like Tom’s IT Pro “Best Of” certification lists for ideas.

Certified Data Management Professional (CDMP)


Developed by the Institute for Certified Computing Professionals (ICCP), the CDMP is a solid, all-round credential for general database professionals. Many employers will recognize the acronym on your résumé.

The CDMP is offered at two levels – Practitioner or Mastery – and awarded to candidates who provide evidence of education, experience and passing results on the CDMP’s professional knowledge exam. Proof of continuing education and professional activity is required to re-certify.


Jobs Similar to Data Engineer


As we’ve mentioned, a data engineer in a large company may work closely with a Data Architect. Like their counterparts in the physical world, an architect is often immersed in the planning stages of infrastructure projects and a data engineer is deeply involved in the actual construction process.

It’s important to note that data engineers are typically not analysts. Their job is to make data available to others; they don’t use it to discover patterns and trends that affect business decisions. If you’re interested in that kind of role, you may wish to consider being a:



  • Data Analyst

  • Data Scientist

Having said that, some smaller companies may combine the roles of scientist and engineer into one.

Data Engineer Job Outlook


It looks like data engineers are finally getting their due. As Alex Woodie of Datanami points out in his 2014 article, “figures from job posting websites show much higher demand for data engineers than for data scientists.” Faced with a tsunami of big data, companies are eager for experts who can “ensure that data pipelines are scalable, repeatable, and secure, and can serve multiple constituents in the enterprise.”

Job opportunities will be best for candidates who aren’t afraid of change! The evolution of Hadoop – which is increasingly being used as an enterprise data hub – rapid advances in processing power for predictive analytics and a general shift towards the Cloud are making a data engineer’s life more complicated than ever before.

Even well-established methods of data management are mutating. For instance, instead of carefully designing a data model before entering data into a system, some engineers are now dumping their information into big data lakes (e.g. a Hadoop repository), where organizations can access and analyze data regardless of schema.

Although these changes can sometimes make for a frustrating work day, it’s an incredibly exciting time to be a builder. If you love playing with new tools and can think outside the relational database box, you’ll be in a prime position to help companies adapt to the 21st century.


The key to building a data science portfolio that will get you a job


Vik Paruchuri 12 AUG 2016 in tutorials, python, portfolio, and project

This is the fourth post in a series of posts on how to build a Data Science Portfolio. You can find links to the other posts in this series at the bottom of the post.

In the past few posts in this series, we’ve talked about how to build a data science project that tells a story, how to build an end to end machine learning project, and how to setup a data science blog. In this post, we’ll take a step back, and focus on your portfolio at a high level. We’ll discuss what skills employers want to see a candidate demonstrate, and how to build a portfolio that demonstrates all of those skills effectively. We’ll include examples of what each project in your portfolio should look like, and give you suggestions on how to get started.



After reading this post, you should have a good understanding of why you should build a data science portfolio, and how to go about doing it.

What employers look for


When employers hire, they’re looking for someone who can add value to their business. Often, this means someone who has skills that can generate revenue and opportunities for the business. As a data scientist, you add value to a business in one of 4 main ways:

  • Extracting insights from raw data, and presenting those insights to others.

    • An example would be analyzing ad click rates, and discovering that it’s much more cost effective to advertise to people who are 18 to 21 then to people who are 21 to 25 – this adds business value by allowing the business to optimize its ad spend.

  • Building systems that offer direct value to the customer.

    • An example would be a data scientist at Facebook optimizing the news feed to show better results to users – this generates direct revenue for Facebook because more news feed engagement means more ad engagement.

  • Building systems that offer direct value to others in the organization.

    • An example would be building a script that automatically aggregates data from 3 databases and generates a clean dataset for others to analyze – this adds value by making it faster for others to do their work.

  • Sharing your expertise with others in the organization.

    • An example is chatting with a product manager about how to build a feature that requires machine learning algorithms – this adds value by preventing unrealistic timelines, or a semi-functional product.

Unsurprisingly, when employers look at candidates to hire, they look at people who can do one or more of the four things above (the exact ones they look at depend on the company and role). In order to demonstrate that you can aid a business in the 4 areas listed above, you need to demonstrate some combination of these skills:

  • Ability to communicate

  • Ability to collaborate with others

  • Technical competence

  • Ability to reason about data

  • Motivation and ability to take initiative

A well-rounded portfolio should show off your skills in each of the above areas, and be relatively easy for someone to scan – each portfolio item should be well documented and explained, so a hiring manager is able to quickly evaluate your portfolio.

Why a portfolio


If you have a degree in machine learning or a relevant field from a top-tier school, it’s relatively easy to get a data science job. Employers trust that you can add value to their business because of the prestige of the institution that issued you the degree, and the fact that it’s in a subject that’s relevant to their own work. If you don’t have a relevant degree from a top-tier school, you have to build that trust yourself.

Think about it this way: employers can have up to 200 applicants for in-demand jobs. Let’s say that the hiring manager spends 10 hours total filtering the applications down and deciding who to do a phone chat with. This means that each applicant is only evaluated for 3 minutes on average. The hiring manager starts off with no trust that you can add value to the business, and you have 3 minutes to build their trust to the point where they decide to do a phone screen.



The great thing about data science is that the work you do on your own building projects often looks exactly like the work you’ll do once you’re hired. Analyzing credit data as a Data Scientist at Lending Club probably has a lot of similarities to analyzing the anonymous loan data that they release.



The first few rows of the Lending Club anonymous data.

The number one way to build trust with a hiring manager is to prove you can do the work that they need you to do. With data science, this comes down to building a portfolio of projects. The more “real-world” the projects are, the more the hiring manager will trust that you’ll be an asset to the business, and the greater your chances of getting to a phone screen.



Directory: wp-content -> uploads -> 2017
2017 -> Leadership ohio
2017 -> Ascension Lutheran Church Counter’s Schedule January to December 2017
2017 -> Board of directors juanita Gibbons-Delaney, mha, rn president 390 Stone Castle Pass Atlanta, ga 30331
2017 -> Military History Anniversaries 16 thru 31 January Events in History over the next 15 day period that had U. S. military involvement or impacted in some way on U. S military operations or American interests
2017 -> The Or Shalom Cemetery Community Teaching on related issues of Integral
2017 -> Ford onthult samenwerking met Amazon Alexa en introduceert nieuwe navigatiemogelijkheden van Ford sync® 3 met Applink
2017 -> Start Learn and Increase gk. Question (1) Name the term used for talking on internet with the help of text messege?
2017 -> Press release from 24. 03. 2017 From a Charleston Car to a Mafia Sedan
2017 -> Tage Participants
2017 -> Citi Chicago Debate Championship Varsity and jv previews

Download 1.33 Mb.

Share with your friends:
  1   2   3




The database is protected by copyright ©ininet.org 2024
send message

    Main page