background

The amount of news that idle fish circulate every day exceeds 100 million , Reach half of users , Because of the nature of second-hand goods , Idle fish users need to know more about the quality of the baby through chat , Conduct commodity price negotiation, etc , Message as the basic function of idle fish , It plays an important role in promoting the transaction of commodities . At the same time in the idle fish , Buyers and sellers are usually individual users , There is a lot of uncertainty about whether it is online or not , Once there is a problem in the message reach , It may affect the transaction of commodities , There will even be cases of fraud in wechat , Therefore, it is urgent to provide users with stable and reliable message service by effective means .

Problem definition

In the field of information , From the user's point of view , The main problem is message loss , Message reach delay, etc . Technically, the root cause of message loss is the end-to-end architecture design , That is, pull messages through the server interface , There are also accs Long connection channel sending message , When a large number of messages arrive at the client at the same time , Message merging or dropping exception will be discarded , Cause user message loss . Online messages are delayed , More accs Channel delay and blocking cause .

But the idle fish message has a long receiving and transmitting link , The processing logic of server and client is complex , Third party access is also involved , So there are three main problems :

*
How to find problems as early as possible before going online ?

*
How to quickly find problems online ?

*
How to effectively position public opinion ?

All the above problems are worth solving , But taking into account the idle fish idle fish message function since the launch , The core logic is implemented by itself, without using the group's ready-made messages SDK, With the crazy growth of the early business, the development alternates , Now the need to develop and take over information is like walking on thin ice , This leads to the problems on the message line . Information problem management , The top priority is to have a comprehensive means of investigation , It can quickly check the problems on the positioning line .

The construction of the whole message link

When positioning public opinion issues , It is difficult to find the essence of the problem only through limited text description and screenshots , We can't even confirm whether it's a client-side problem or a server-side problem . A chestnut, for example , One night when we're ready to leave work , The boss threw out a public opinion , Loss of user feedback message , At this time, everyone panicked , Server , client , Quality students all gather together to locate the problem , And the only way to locate it is to check your log , I check my diary , I'm still guessing blindly , Time consuming, labor consuming and inaccurate .

From this point of view , To improve the quality of leisure fish message service , Development and optimization alone is far from enough , Infrastructure projects are also needed to improve positioning issues , So the quality team and the development team focus on the construction of the whole link investigation of idle fish news . To do message full link investigation , The key is to have a comprehensive message log support , Can get the complete track of the message . We log the server-side messages to the node , Interface log , Client message status buried point log , And behavior buried point log aggregation , Restore the user's behavior and message path to the maximum extent .

Log reporting

The first is the core scenario of message link , For message merging , Fall to the bank , On screen , Domain ring synchronization , Domain ring update and other key nodes that are easy to appear or reflect problems , It will report the logs needed for troubleshooting .

The second is the log format convention , The core is that when the client generates each message, it will also generate one messageId, For marketing messages pushed by the server ,messageId Is generated by the server . Every time a message passes through a core node, only the passing state is added with the messageId Report , The id It will also be transmitted to the server for reporting logs , such messageId A complete tracking link from the client to the server and then to the client is connected in series .

Finally, the way of reporting , The initial idea was end-to-end access SLS
SDK Real time reporting , But the cost of change on the opposite end is too high , It's going to have an impact on stability, so I gave it up , Finally, the client buried point reporting path is reused , Later, you just need to clean the buried point log in real time , It meets the requirement of real-time log acquisition without end-to-end modification . Server side log reporting uses the existing SLS Reporting link .

It should be emphasized that , Privacy and storage cost considerations , Log reporting will not bring specific message content , Only the necessary parameters and parameters will be brought messageId.

Real time log cleaning

Real time log cleaning , First of all, I subscribe to the TT Minute level buried point log , Clean out the message related buried point log . But there's a huge amount of data in it , There may be dozens of buried points for one message . So we follow the messageId,utdid Cluster these logs , Reduce the magnitude of data by dozens of times , Finally, write the data back SLS Used to check the link , Write minute level statistics to TDDL For monitoring construction .

User behavior log

In addition to focusing on the life cycle of a single message , Another aspect of problem checking is to look at the user's behavior in the terminal macroscopically, which triggers the exception .

Through real-time cleaning and scheduling client reported click and page exposure buried point , To find out which buttons the user clicks before an exception occurs , Which pages were visited , Thus, the repeatable path of the exception is analyzed . At the same time, integrate the server interface call log , Check whether the server interface is successfully requested when the user exception occurs , Is the request parameter correct , What is the exception error code . These are effective information integration , It helps us to reproduce and locate the specific scene , Assistant development to solve problems faster .

Front desk interaction

“ The army did not move , Data first ”—— The previous data preparation enables us to do a more detailed investigation of the possible problems in the message link , So how to make the problem more obvious , Making development more convenient is another goal . By observing the development of students in the investigation of message problems , Generally according to the user , news ID, conversation ID Three latitudes , So we classify and sort them out .

In addition, in order to make the user more intuitive to observe the message in each node of the link in the process of query , We classified the links : Client uplink , Server , Client downlink . At the same time, the abnormal nodes will be significantly reminded , Let users quickly find out where the problem is , Which link .

Summary and Prospect

Now? , We use the link checking tool of leisure fish message quality platform , You can clearly see the complete life cycle of the message , And according to the abnormal user behavior log to check the problem path , Which link may have problems in rapid positioning of aided development , Instead of human flesh search multiple databases and log streams for integration analysis , Even the next day to get the data , Improve the efficiency of investigation 90% above . In addition, the mode is reusable , For example, release link checking is also in platform access . besides , The message quality platform has also done a lot in the efficiency of problem discovery and testing , For example, the inspection and reconciliation capability on the end , Real time monitoring of core indicators , Public opinion management and link level test regression tools will be introduced in detail later . There is also a combination of automation and end-to-end intelligence , And that's what we're trying to do , Hope to be escorted by us , The news of idle fish is more and more stable .

always keep oneself busy ? Fish in leisure !

PICK ME

Leisure fish technology team pursues more value through innovation , Driving business change .

From idle business to old business , To build “ Worry free shopping ”“ Play community ““ New offline ”,

From publishing books , Summit voice , To open source patent , Overseas communication ,

always keep oneself busy , Fish in leisure —— The technical team's exploration and deep cultivation of the acme is our foundation .

  Join now  

1, Recruit clients / Server / front end / framework / Quality Engineer

2, Send resume to guicai.gxy@alibaba-inc.com

3, You can also make headlines , Zhihu , Nuggets ,facebook,twitter Find us

Technology
©2019-2020 Toolsou All rights reserved,
Hikvision - Embedded software written test questions C Language application 0 The length of array in memory and structure is 0 In depth analysis data structure --- The preorder of binary tree , Middle order , Subsequent traversal How to do it ipad Transfer of medium and super large files to computer elementui Shuttle box el-transfer Display list content text too long 2019 The 10th Blue Bridge Cup C/C++ A Summary after the National Games ( Beijing Tourism summary )unity Shooting games , Implementation of first person camera python of numpy Module detailed explanation and application case Study notes 【STM32】 Digital steering gear Horizontal and vertical linkage pan tilt Vue Used in Element Open for the first time el-dialog Solution for not getting element