BI engineer , Data Warehouse Engineer ,ETL engineer , Data Development Engineer （ Big data development engineer ） What's the difference? ?
It may not be interesting to explain the concept of data warehouse blindly , Let's start with different roles
： I'm the boss of a mobile phone company , Report to the board today , I would like to prepare a presentation on the growth of users over the past three years , User retention , User activity , Every cell phone APP Report of usage rate , If it's not down there, I'm not down there BI, Then I must be forced ..
I'm a non-technical BI, I read the analysis report of competitive products every day , Look at the sales volume of double 11 , See the comments , Know what are the advantages and disadvantages of your products , I analyze the regional differences between the north and the South , Customer preferences at home and abroad , In short, I have a strong industry interpretation ability and data interpretation ability in the field of mobile phones , I can draw very beautiful charts and PPT.
FineBI Visualization done
Today, my boss asked me to make a report , I have to brush my face ETL The engineer ran out the data of this report for me , Based on this data, I want to give some interpretation , Why mobile phones are not selling as well this month as last month , Why is the loss of users becoming more and more serious? It's all I have to do .
ETL engineer :
I'm the bottom of the food chain ETL engineer , I can write shell, I will hadoop/hive/hbase, Can write Super complex logic sql, Today, that one won't calculate the data by himself BI Let me run some more data , I wanted her to come up with the requirements process , But she said the boss wanted it ( Killer mace used in operation !!!), Need urgent treatment .
I can only put down my work and run data for her , It took half an hour to run the data to her , I hope that's how it works .
If you think I do that every day, you're wrong , My normal work is not only to complete the tasks assigned to me , I'm also in charge of data ETL process , Data modeling , Assignment of scheduled tasks , Even sometimes Hadoop I have to do cluster maintenance and so on , You can write a book by taking everything out alone .
Take it ETL In the process , You have to take raw data from various databases , Different business logs of various servers are normalized to the same format , You need to agree on a separator , Then import to the distributed file system HDFS, Even you need to define data format and specification with business system .
Data collection completed , And you get the middle table , Data filtering , use the same pattern ,ID unified , Unity of dimensions , Data through different data phenomena , finished , You've got some daily and weekly data , At this time, you need to organize the data into a certain format according to the requirements and then Mysql, perhaps HBASE wait .
All in all, you just need to collect all kinds of data , Various treatments , Then import and export , Isn't it very interesting ?
But these data warehouses are very junior , among ETL There is too much room for Engineers
1, Under normal circumstances , boss —> BI —> ETL
Make a report , Can we BI Calculate data directly ?sql Too complicated , Can we label all the data ,BI Even the boss can choose whatever he wants ?
2,ETL Engineers can automate data collection , Business log format can be standardized , Everything can be configured , But it's all based on N+1 Of , That is to say, what happened today must not be seen until tomorrow , Is there a system that can analyze data in real time or quasi real time ? Refer to double eleven screens , If horses always arrive 12 No. 1 can tell how many transactions have been completed and how many of them have not been split. It's strange that those who do the data are not split .
3, At present, most of the analysis systems are based on off-line computing （HADOOP/ODPS）, So there's a problem here , Operation or BI If you want to see the data, you have to go offline and run slowly to see it , So is there a system that can support your data volume , More complicated logic , Millisecond data output ?
We also mentioned algorithm engineers , Big data operation and maintenance engineer, etc .
The concept of data warehouse is very broad , But it's not worth mentioning in front of big data applications .
If data value is stratified , There are many ways of layering here , I'll just list one way , Someone has 5 layer
first floor : Provide decision support to the boss , For example, traditional financial statements
The second floor : Decision support for operations , For example, Taobao operators with very thorough data
Third floor : Support products , For example, some product managers will take the reports and read them every day to find out whether a certain button is placed in the right place
Fourth floor ： Data for production , For example, direct connection with the advertising system generates revenue , For example, directly connect the recommendation system to recommend products to users , Realize thousands of people and thousands of faces , Another example is the use of mobile phones APP Direct to different users push news
Fifth floor ： Big data exchange , Direct benefits from data
Most companies can do the first two levels already very good , If we can achieve the third level , It's pretty awesome , Achieve the fourth and fifth level , Domestic Internet companies no more than 3 home , Alibaba and Tencent can do it , Big data application is too big , I don't know where to start , Let's talk later .