background

The amount of news that idle fish circulate every day exceeds 100 million , Reach half of users , Because of the nature of second-hand goods , Idle fish users need to know more about the quality of the baby through chat , Conduct commodity price negotiation, etc , Message as the basic function of idle fish , It plays an important role in promoting the transaction of commodities . At the same time in the idle fish , Buyers and sellers are usually individual users , There is a lot of uncertainty about whether it is online or not , Once there is a problem in the message reach , It may affect the transaction of commodities , There will even be cases of fraud in wechat , Therefore, it is urgent to provide users with stable and reliable message service by effective means .

Problem definition

In the field of information , From the user's point of view , The main problem is message loss , Message reach delay, etc . Technically, the root cause of message loss is the end-to-end architecture design , That is, pull messages through the server interface , There are also accs Long connection channel sending message , When a large number of messages arrive at the client at the same time , Message merging or dropping exception will be discarded , Cause user message loss . Online messages are delayed , More accs Channel delay and blocking cause .

But the idle fish message has a long receiving and transmitting link , The processing logic of server and client is complex , Third party access is also involved , So there are three main problems :

*
How to find problems as early as possible before going online ?

*
How to quickly find problems online ?

*
How to effectively position public opinion ?

All the above problems are worth solving , But taking into account the idle fish idle fish message function since the launch , The core logic is implemented by itself, without using the group's ready-made messages SDK, With the crazy growth of the early business, the development alternates , Now the need to develop and take over information is like walking on thin ice , This leads to the problems on the message line . Information problem management , The top priority is to have a comprehensive means of investigation , It can quickly check the problems on the positioning line .

The construction of the whole message link

When positioning public opinion issues , It is difficult to find the essence of the problem only through limited text description and screenshots , We can't even confirm whether it's a client-side problem or a server-side problem . A chestnut, for example , One night when we're ready to leave work , The boss threw out a public opinion , Loss of user feedback message , At this time, everyone panicked , Server , client , Quality students all gather together to locate the problem , And the only way to locate it is to check your log , I check my diary , I'm still guessing blindly , Time consuming, labor consuming and inaccurate .

From this point of view , To improve the quality of leisure fish message service , Development and optimization alone is far from enough , Infrastructure projects are also needed to improve positioning issues , So the quality team and the development team focus on the construction of the whole link investigation of idle fish news . To do message full link investigation , The key is to have a comprehensive message log support , Can get the complete track of the message . We log the server-side messages to the node , Interface log , Client message status buried point log , And behavior buried point log aggregation , Restore the user's behavior and message path to the maximum extent .

Log reporting

The first is the core scenario of message link , For message merging , Fall to the bank , On screen , Domain ring synchronization , Domain ring update and other key nodes that are easy to appear or reflect problems , It will report the logs needed for troubleshooting .

The second is the log format convention , The core is that when the client generates each message, it will also generate one messageId, For marketing messages pushed by the server ,messageId Is generated by the server . Every time a message passes through a core node, only the passing state is added with the messageId Report , The id It will also be transmitted to the server for reporting logs , such messageId A complete tracking link from the client to the server and then to the client is connected in series .

Finally, the way of reporting , The initial idea was end-to-end access SLS
SDK Real time reporting , But the cost of change on the opposite end is too high , It's going to have an impact on stability, so I gave it up , Finally, the client buried point reporting path is reused , Later, you just need to clean the buried point log in real time , It meets the requirement of real-time log acquisition without end-to-end modification . Server side log reporting uses the existing SLS Reporting link .

It should be emphasized that , Privacy and storage cost considerations , Log reporting will not bring specific message content , Only the necessary parameters and parameters will be brought messageId.

Real time log cleaning

Real time log cleaning , First of all, I subscribe to the TT Minute level buried point log , Clean out the message related buried point log . But there's a huge amount of data in it , There may be dozens of buried points for one message . So we follow the messageId,utdid Cluster these logs , Reduce the magnitude of data by dozens of times , Finally, write the data back SLS Used to check the link , Write minute level statistics to TDDL For monitoring construction .

User behavior log

In addition to focusing on the life cycle of a single message , Another aspect of problem checking is to look at the user's behavior in the terminal macroscopically, which triggers the exception .

Through real-time cleaning and scheduling client reported click and page exposure buried point , To find out which buttons the user clicks before an exception occurs , Which pages were visited , Thus, the repeatable path of the exception is analyzed . At the same time, integrate the server interface call log , Check whether the server interface is successfully requested when the user exception occurs , Is the request parameter correct , What is the exception error code . These are effective information integration , It helps us to reproduce and locate the specific scene , Assistant development to solve problems faster .

Front desk interaction

“ The army did not move , Data first ”—— The previous data preparation enables us to do a more detailed investigation of the possible problems in the message link , So how to make the problem more obvious , Making development more convenient is another goal . By observing the development of students in the investigation of message problems , Generally according to the user , news ID, conversation ID Three latitudes , So we classify and sort them out .

In addition, in order to make the user more intuitive to observe the message in each node of the link in the process of query , We classified the links : Client uplink , Server , Client downlink . At the same time, the abnormal nodes will be significantly reminded , Let users quickly find out where the problem is , Which link .

Summary and Prospect

Now? , We use the link checking tool of leisure fish message quality platform , You can clearly see the complete life cycle of the message , And according to the abnormal user behavior log to check the problem path , Which link may have problems in rapid positioning of aided development , Instead of human flesh search multiple databases and log streams for integration analysis , Even the next day to get the data , Improve the efficiency of investigation 90% above . In addition, the mode is reusable , For example, release link checking is also in platform access . besides , The message quality platform has also done a lot in the efficiency of problem discovery and testing , For example, the inspection and reconciliation capability on the end , Real time monitoring of core indicators , Public opinion management and link level test regression tools will be introduced in detail later . There is also a combination of automation and end-to-end intelligence , And that's what we're trying to do , Hope to be escorted by us , The news of idle fish is more and more stable .

always keep oneself busy ? Fish in leisure !

PICK ME

Leisure fish technology team pursues more value through innovation , Driving business change .

From idle business to old business , To build “ Worry free shopping ”“ Play community ““ New offline ”,

From publishing books , Summit voice , To open source patent , Overseas communication ,

always keep oneself busy , Fish in leisure —— The technical team's exploration and deep cultivation of the acme is our foundation .

  Join now  

1, Recruit clients / Server / front end / framework / Quality Engineer

2, Send resume to guicai.gxy@alibaba-inc.com

3, You can also make headlines , Zhihu , Nuggets ,facebook,twitter Find us

Technology
©2019-2020 Toolsou All rights reserved,
C Review of basic language knowledge Go Language learning notes (GUI programming )Java Misunderstanding —— Method overloading is a manifestation of polymorphism ? How to achieve low cost and high stability for cloud native applications ?elementui Shuttle box el-transfer Display list content text too long C/C++ Memory model Element-Ui assembly Message Message prompt , alert Popup C# Making a simplified version of calculator Python In pycharm editor Interface style modification Tiktok refresh progress bar ( Two little balls turn ), The code is simple