Do you want to move towards the Data professions? Do you want to become a Data Engineer? The time is right to get started, because it is data that powers all the activities of our current digital society; for example: advertising management is increasingly based on data, Artificial Intelligence, green agriculture, hybrid vehicles, smart objects, connected objects, smart computing , etc. In short, the viability of the majority of economic models of our time depends on the intelligent exploitation of data.
Companies increasingly need specialists trained in massively parallel data processing approaches and capable of intelligently exploiting them. Take a look at indeed.fr and type in “Data engineer” or “Data Engineering”, or even “Big Data engineer”, and you will understand what we are talking about! Thus, the timing is perfect for a career as a Data Engineer!! In addition, the job of Data engineer is the job that is most in demand since the emergence of Big Data, well ahead of that of Data Scientist.
This column is a job sheet in which we explain the complete steps to follow to become a Data Engineer. We will talk about the job profile, the interest of a Data engineer in a company, his missions, his skills, his salary, his career development, and the training to follow to become a data engineer.
1 – What’s a Data Engineer?
Still qualified as a “big data engineer”, the Data engineer is the first actor in the data processing process. His work occurs upstream of that of the Data Scientist, directly after the technical infrastructure has been implemented by the architects and administrators.
The Data Engineer specializes in cross-referencing and large-scale data management issues using very specific tools and techniques. A person oriented towards this profession will be able to use massively parallel computing frameworks such as Hadoop or Spark to manage large volumes of data. The Data engineer uses his technological expertise to help companies overcome their data quality problems, validation of data compliance with management rules defined by business departments. Clearly, you are moving towards this profession if you want to help companies with the operational aspects of managing their data.
In practice, his daily work will consist of connecting to several data sources, cross-referencing data, performing data cleaning operations, filters, joins, managing data storage in different databases, managing various kinds of data formats, and potentially produce cross-reports of this data.
The demand for this profession has been steadily increasing since 2016 and is driven by the increasingly growing transition of companies from traditional Business Intelligence systems to Big Data systems and by the implementation of Data Lab. We will come back to this below.
2 – Why Does A Company Need A Data Engineer?
What motivates the recruitment of a data engineer in a company? In fact, the Data Engineer is essential to a company mainly for 3 reasons:
A) Data Collection is by definition a siled process in a company.
This means that each business department collects and manages its data independently of the other departments. As a result, data is dispersed across the different business units of the company: these are called “data silots”. With “siloed” data, it is impossible to have a global vision of the company’s activity. Data silos create duplicates and incomplete versions of data, which in turn create problems of incompleteness (missing value) detrimental to machine learning work. Companies suffer greatly from the problems caused by data silos. The explosion of data in the Big Data era has given these issues unprecedented prominence. Thus, to make oriented and effective decisions, it is essential beforehand to “de-silo” the company’s data, that is to say to standardize and consolidate them either in a specific place (the Data Lab ), or using a single repository (the data warehouse/data lake). The Data Engineer intervenes to guarantee the standardization (or separation) of this data, and to develop applications that will use it.
B) Reporting
The standardization of data actually has only one goal: to support decision-making. To exploit the data that has been cleaned, it must be interrogated to obtain the indicators necessary for decision-making. Depending on the volume, the characteristics of the data, and the IT tools used by the company, it can be very complex, to the point where a whole discipline is necessary to deal with the subject. When it comes to querying data, business intelligence refers to the methods, techniques, and computer tools used to effectively query data and assist in decision-making. Due to his mastery of business intelligence, the Data engineer is the most appropriate profile to produce reports and indicators essential to support the monitoring of the company’s activity.
C) Complex use cases requiring new paradigms
The third situation leading to the need for a data engineer in a company is Big Data. Before Big Data, the reasoning used to value data was relatively simple: centralize the storage and processing of data in the server of a client/server architecture. The central server here is a very powerful machine, custom designed by companies specializing in IT infrastructure such as EMC, Lenovo, Dell or HP. We detail this philosophy in depth in the “Introduction to Hadoop” column. Unfortunately, in the era of Big Data, this reasoning no longer makes sense! Indeed, the scale of data growth today surpasses the reasonable capacity of traditional technologies, or even the typical hardware configuration supporting access to this data. The appropriate reasoning now is to distribute data storage and parallelize their processing on the nodes of a cluster. Similarly, in other specific use cases such as the enhancement of data produced over time or in streaming, new paradigms or ways of understanding the use case in question are necessary. Thanks to his varied skills and his in-depth knowledge of conceptual approaches to data processing, the Data Engineer is essential to address the different use cases or data problems that the company encounters.
3 – The 4 Tasks of a Data Engineer
As you have seen previously, the role of the Data Engineer can be summed up in 2 points: the standardization of data and their consolidation for decision-making. From this role derives its 4 main missions:
- Design the technical architecture necessary for the valorization of data. The architecture can be global, for example for data lake construction projects; or local, for specific use cases, from specific business departments. In this case, the Data Engineer must define and validate the architectural choices of the Big Data solutions to adopt.
- Provide the necessary technological expertise to develop the data solutions appropriate to the different data use cases emanating from the business units of the company (Kerberization of the Hadoop cluster, securing the infrastructure, choice of Big Data technologies adapted to the use cases business, development of Big Data solutions with business units, modeling and implementation of databases, construction of the Data Warehouse, etc.)
- Carry out the necessary data cross-referencing as well as the validation, correction and quality work necessary to support the work of downstream data scientists. As a reminder, Data scientists need quality data to carry out data science work, because machine learning algorithms are very sensitive to missing values, gaps (outliers), or even consistency in the structure. internal data. The Data Engineer must do everything necessary upstream to provide “clean” data to Data scientists.
- In some cases, the Data Engineer may be required to perform decisional analyzes on the data he processes (although most of the time, this role is assigned to the Data Analyst). In this case, he will carry out data cross-referencing and consolidation work leading to reporting that will support decision-making, he will develop dashboards and performance indicators (KPIs) using different technologies (depending on the company’s IT assets).
4 – The Skills of a Data Engineer
The standardization of data, the consolidation for decision-making analyses, and the development of application solutions require very diverse skills, both from a conceptual point of view (in the way of approaching the problem) and from a technical point of view. technical view (in the mastery of techniques and technologies that effectively allow the solution to the problem to be implemented). Thus, inherently to the diversity of Data issues in general, and Big Data in particular, the Data Engineer must be a profile with a wide range of skills, to the point where it may even be interesting to specialize only in one type of problem: for example, we can have data engineers specialized only in database problems, others only in content research problems, others still only in the large-scale exploitation of data , and still others in the issues of streaming and real-time data.
In the same way, when you decide to pursue a career as a data engineer, you can (and it is recommended) to specialize in one issue and have basic knowledge of other issues. It will be easier for you later to seriously build skills on other Big Data issues.
However, although we recommend specialization, all Data Engineers have a common set of skills. These skills fall into two categories, namely:
- Conceptual skills, focused on data management issues.
These are the skills that allow the data engineer to conceptually approach each type of problem. For example, to address streaming data processing issues, it is necessary to master the concepts of message delivery semantics, exactly-once semantics, atomic message delivery, data bus, publish-subscribe messaging system, etc. While to address database issues, you need to know the different categories of DBMS (SQL, NoSQL, NewSQL, column-oriented, key/value, etc.), decision modeling, storage in a distributed environment, OLAP cubes , etc. Each data engineering problem has specific requirements, and the data engineer must have a global understanding of how to address these requirements; - Technological skills, focused on mastering specific tools and languages, because each problem has its own tools and languages.
For example, if we consider large-scale data querying issues, addressing them technically will require mastery of Hadoop, Spark, Kafka, HBase, Cassandra, Hive, Pig, Oozie frameworks, and mastery of SQL, Scala and Python. It will also be necessary to master the tools necessary for the deployment and management of the application life cycle, such as Maven, Nexus, Git, Jenkins, etc. But for the same issues, if we are on a reasonable scale of data, then mastering SQL, and platforms like Teradata are sufficient. In a pinch, the python would even do the job very well on its own. You can therefore see that the skills to be developed vary according to the needs of the company, the scope of the projects, and the issues in force. Here is his full skills matrix.
Full Data Engineer Skills Matrix:
- mastery of programming languages: Scala, Java, Python, Shell, VBA
- knowledge of operating systems: UNIX, Linux, Solaris, Windows
- knowledge of SQL database solutions: Teradata, Microsoft SQL Server, SAS Base, SAP Hana
- knowledge of NoSQL systems: Elasticsearch, HBase, Cassandra, Redshift
- knowledge of ETL processes and tools: Talent open studio, Pig Latin, Sqoop.
- strong expertise in SQL and derivatives: SQL, HiveQL
- mastery of massively parallel data computing frameworks: Hadoop, Spark, Kafka
- knowledge of techniques for improving the performance of queries and Business Intelligence systems (OLAP)
- know how to consolidate data, produce KPIs and build dashboards using tools such as Excel Power BI, Tableau Software, or QlikView.
- be comfortable in cloud environments: GCP, Azure HDInsight, AWS
- be comfortable with continuous integration and deployment tools: Jenkins, git, GitHub, gitlab, creation of CI/CD, docker, Ansible, kubernetes, etc…
- have a basic level of knowledge on Machine Learning, Data science, and Artificial Intelligence in order to be able to work in collaboration with Data Scientists.
If we stay within the strict framework of Big Data, the Data Engineer must know how to use Hadoop (MapReduce) or Spark to address data ingestion issues on a large scale. He must master the use of SQL tool categories on Hadoop (Impala, Phoenix, HAWQ), abstraction languages (HiveQL, Pig Latin) and NoSQL databases (HBase, HCatalog, MongoDB). He must be able to write SQL, HiveQL, Pig Latin queries for querying databases, he must be able to connect companies’ traditional Business Intelligence systems to Hadoop, write complex queries necessary to solve business needs for Reporting, calculation of indicators, and exploitation of data for Reporting purposes, querying databases and exploiting them for the integration of data in various formats. The following figure represents the circle of skills of the big data engineer.
5 – Data Engineer Training: How To Become a Data Engineer?
The most obvious way to develop your skills as a Data Engineer is to complete a specialized Master’s program. Please note, when we talk about a specialized Master’s, we are not referring to a diploma from a French establishment labeled by the Conférence des Grandes Ecoles, nor to a post-master’s degree (diploma obtained after a master’s degree)! We refer to any master, whether public, private, accredited by a conference of major schools or not, which exclusively provides lessons on Big Data. The purpose of such a Master is to provide you with the necessary infrastructure to learn Big Data technologies. Some masters can add business lessons to this, showing the non-technical aspects of data, such as legal aspects, MDM, data management, GDPR, etc.
If a specialized Master’s is beyond your means or if for one reason or another you cannot take it, it’s not all over! You can also develop the skills of Data Engineer by passing several certifications, especially if you already have a good level as an IT consultant or if you are self-taught. Certification validates the skills and competencies acquired on a subject and can be a good way to position yourself as an expert on the subject in the market (I have 6 myself). If you prefer this path, we recommend the following 3 certifications:
- Cloudera Certified Professional Data Engineer: Offered by Cloudera, this certification covers data ingestion, transformation, storage and analysis on Cloudera’s Hadoop distribution using Spark SQL, Spark Shell, Hive, Spark Streaming, Kafka , Flume, Python and many other tools from the distribution. To pass this certification, Cloudera recommends taking the training it has called Cloudera’s Spark and Hadoop Developer;
- MapR Certified Hadoop Developer: offered by MapR (since acquired by HP), this certification validates skills in the development of MapReduce programs in java. The exam tests the candidate’s ability to write MapReduce programs, effectively use the MapReduce API, and manage and track the execution of MapReduce workflows. Of course the Hadoop distribution used is the MapR distribution. To prepare for the exam, MapR invites interested parties to take the DEV 301 – Developing Hadoop Applications course;
- EMC DELL Certified Data Scientist Associate: Before being acquired by DELL, EMC developed a certification program that was more comprehensive than the programs of Hadoop publishers. This program covers the whole of data exploitation, specifically statistical learning techniques on MLib and on R, data visualization and presentation techniques, the exploitation of GreenPlum, the writing of requests for processing data in MapReduce, HiveQL, and Pig, data storage in HBase, functional knowledge of the main tools of the Hadoop ecosystem and business skills on recommendation, classification and sentiment analysis issues. To pass this certification, EMC recommends following its “Data Science and Big Data Analytics” training course. As a holder of this certification, we also strongly recommend it;
6 – Data Engineer Salary
According to Glassdoor, the total annual pay a data engineer earns is $111,998, including base salary and additional pay like bonuses and profit sharing. The average total pay for a senior data engineer, meanwhile, is $154,989 a year. Generally, you can expect to earn a salary that is higher than average as a data engineer.
Note: Total pay is the combined amount of Glassdoor users’ reported average salary and additional pay, which could include profit sharing, commissions, cash bonuses, or tips.
- Scala Data Engineer – £90,000
- Engineering Manager – £90,000
- Senior Data Engineer – £100,000
- Contract Data Engineer – £500-650 per day
- Big Data Engineer – £90,000
- Data Engineer – £100,000
7 – Data Engineer Career Development
Thanks to the diversity of his profile, the data engineer can evolve during his career on several positions, but 2 positions are more likely, it is about Big Data architect, or Tech Lead:
- Career evolution towards a position of Big Data Architect: thanks to his conceptual skills and his in-depth knowledge of the operation of the different technologies and data processing paradigms, the Data engineer can easily evolve towards a career of Big Data Architect, where these skills will be used to advise companies in the appropriate architectures for the data use cases they face. The transition to the position of Architect is possible after 5 years of Data Engineering.
- Career evolution towards a Tech Lead position: the position of Technical Leader (abbreviated as Tech Lead) is the natural progression of the Data Engineer. Generally after 3 years, the Data engineer has sufficiently refined his technical skills and his skills in software engineering to easily become the technical reference for an entire Big Data project.
8 – Difference Between A Data Engineer and A Data Scientist
Some still confuse the work of a Data Scientist with that of a Data Engineer. The confusion comes from the fact that the 2 profiles both work in the operational aspects of the data. But make no mistake! These are two different professions with different roles, and different functions! The Data Engineer performs the upstream data quality and validation work required for Data Science work. As we said above, the machine learning algorithms that the Data scientist will use to build his statistical learning model are very sensitive to the consistency and quality of the data. Thus, where the Data engineer is responsible for providing valid data, the Data Scientist is responsible for making sense of it.
ABOUT LONDON DATA CONSULTING (LDC)
We, at London Data Consulting (LDC), provide all sorts of Data Solutions. This includes Data Science (AI/ML/NLP), Data Engineer, Data Architecture, Data Analysis, CRM & Leads Generation, Business Intelligence and Cloud solutions (AWS/GCP/Azure).
For more information about our range of services, please visit: https://london-data-consulting.com/services
Interested in working for London Data Consulting, please visit our careers page on https://london-data-consulting.com/careers