Make Service Fault Transparent

This article is an English one, because I really need to work on the language. Sorry if it is not easy to understand.

A Summary to What's Happening Recently

Recently in my campus, IT service is very unstable.

  • In March, many people posted on forums that they tried to top up campus Internet account by WeChat, but more money (maybe 100x) than they paid were topped up.
    • Later WeChat top-up service were disabled. Because most people were not aware of the existing offline top-up-by-card service, many of them became arrearage.
    • Several days later, campus Internet's charging system was disabled, which means you can use it for free. Later the charging system was resumed, but only charging at the monthly fee (not counting flux fee).
    • An unnoticeable statement was published then, indicating that it was caused by a bug from the software company.
  • On March 20th, campus card users who used their cards to drink hot water or eat breakfast, found their card locked. (Those lazy guys were not affected at all)
    • In the morning nobody knows whether the issue was being solved, until at around 11 (lunchtime) my school's instructor sent an announcement that "there will be unlock service in canteens, please keep order and don't panic at the scene". At canteens announcements by canteens' administrator is put up. Unlocking was quick and easy, but most people still went to canteens where Alipay is accepted.
    • Later that afternoon public statement by card administrator was out: It was a service fault (on BITUnion some said that it's a bug hidden for 14 years). IT staffs explained on BITUnion that they tried to work out solutions and mitigate the issue before they drafted public statements.
  • In these months campus Internet is unstable: During peak hours it became very slow or even unavailable. Maybe it's around 2%'s downtime (in a 24-hour aspect), looking not that much, but users surely could experience that.
    • The causes seem very complex. In my view, new DNS servers, old cache servers, new firewall systems, new upstream link providers and upstream link issue all can cause problems. And of course those new facilities all need to be fine-tuned, which takes time.
    • Currently no authentic statement is published. But in the IT service monthly report (which most people are not aware of), it said "Issue fully fixed, during peak hours upstream links can work in full bandwidth now". One of the reasons they mentioned was "DDoS attack causing network core server CPU instant usage up to 99% (usually ~20%)".
    • However, as student representatives meeting will be held, many representatives will raise the heated Internet issue onto the meeting. But I believe most of they will never get the point why this is happening.

"Totally nailed the fix"

Why Fault Needs to Be Transparent

As you can see, suddenly all issues came into being, but they will not happen because of no reason. Anyway apart from solving issues, making the solving process transparent is also important. Why?

Because, Information technology is becoming essential to our life, just like water and electricity supplies. To this point, it is not anything "advanced" any more, for which people get high expectations to that. What's more, IT is developing fast (counting with years, not decades), thus people's expectations are growing fast with it.

It's quite a challenge for campus IT service to catch up with that. But firstly, they are working on that. If they don't speak, people thinking the service essential will imagine "It's just messing up my life, and they just don't try hard to solve that". This is surely a gap between the two's understanding.

"Why you leave the esculator unfixed for ONE MONTH!"

P.S. Some good man has reminded me that, sometimes there will be staffs not working at all in the "old system". But I guess in my campus they work hard.

Another problem if IT service is not transparent "in time" is that, users don't know whether they need to report or wait. Of course most of us will silently wait for the fix - most of us are busy, right? But what if the staffs don't know the issue at all? We don't know whether they know the issue, and most people won't trust others forever and believe "they must be fixing it now". This might be a more misleading situation, which causes user dissatisfaction.

I can't think of any disadvantage of being actively transparent to faults for a hard-working public service, so I strongly believe this theory.

Ah, yes, I have to highlight that what I mean here about transparency, is "instant transparency". Something this brings one problem: when you realize that you identifed a wrong cause that you published before, you have to recall the previous statement, which brings confusion. If everybody is wise and realizes that people can make mistakes, this is not a problem at all, and you can just leave your previous "wrong" statement there.

In Staytus's demo, an issue became red again from Monitoring status

Tool and Platform is Not That Important

People may argue that, "we might not have the right tool to do that for now". Probably the tool doesn't fit, but when you have the idea to do the right thing, tools and platforms are not a problem.

A good example in my campus is the student financial service. They always use forums to answer students' scholarship questions. Though the forum they choose is not that popolar, and I guess some scholarship project process information can be formatted in a nicer single page, but firstly they choose to be transparent.

IT service, on the contrary, is:

  • Essential, so users need feedback more instantly;
  • Wide, so physical service and on-site announcement in all areas is expensive;
  • Complex, where hardware, software and configuration all matters.

Thus a digital way might be a better way to provide transparency.

But what if "the digital way" is faulty? We can put the solution on a school server that hardly fails (probably standalone) and connects with both Intranet and Internet. And a better solution might be prepare for the worst: Choose a third-party (VPS outside Intranet) or public service (Weibo or WeChat), and hope that it won't fail when our infrastructure fails. Unreliable as it seems, you are winning a lottery if everything fails (maybe once in a lifetime?), and you won't hestitate to do the physical announcements.

Yeah, maybe your physical announcement is not enough...

A Blueprint Specifically for IT Service

When everybody is busy, this kind of customer service cannot be depended only by "I contacted you and you talk to me". Some self-service theory can be incorporated here: Make status updates available to everyone. When they need help, they can check on the updates, rest assured, and calmly wait.

I heard that the support ticket systems for IT services is being considered now, but now the "status page" thing is more important.

We have talked about the platforms, right? We will look into them one by one.

  • Webpage, which is very customizable, seems good. But no matter it's in the browser, or inside WeChat WebView, it can't push notifications by itself.
    • However, when users met issues, if that matters to them, they will check the status themselves. So pushing doesn't matter that much.
    • When we have met a disaster and need to "push" some apologies, it doesn't need to be instant and frequent. That's not in the aspect of what we are talking about.
  • Weibo seems good, and can be a choice. But two problems: It is so public that sometimes it's not that good. Last but not least, when everybody uses WeChat, who cares about Weibo?
  • WeChat official account's problem is that it can't push messages that frequent. When you have limits, you might not want to be that transparent. And yes, users don't want to receive that frequent messages.
  • WeChat enterprise account seems don't have these problems. It doesn't limit your push frequency. But when you choose this, remember, this is not a long-running solution (surpassed by Enterprise WeChat App), and this is not supported by PC and Windows Phone. Seems not that fit to be called "transparent" unless you provide a webpage alternative.
  • For other push methods, people hardly use emails, and SMS are expensive, and you probably think of mobile app? Nobody likes this to be heavy.

As I said above, when fault happens, users have motivations to "check status". Thus frequent, up-to-date, no-need-to-push-to-everybody status update looks good.

The conclusion is that, it's best to have

  • a self-hosted standalone status webpage,
  • linked from major IT platforms (in my campus's case, wechat enterprise account and IT department website),
  • which can be quickly deployed to external VPS and work if the self-hosted one crashed,
  • whose data can be consumed via Webhooks or API by other official platforms, like Weibo, WeChat or something.

Of course this have some technology expenses, thus choosing a existing public service (in the short term) is fine, too.

"How to publish" is easy: we can formulate some statement templates (like the well-known investgating/identified/monitoring/resolved model), and when being used, add details to the statements. And we can form rules of updates, to keep transparency, like at lease publish one update every X hours.

Pre-translated templates in Google's statusboard; notice the "we have additional English explaination" sentences

We also need someone to publish messages (I know in China this is a bigger problem). A good technical writer should be recurited. But I think it can be achieved by part-time job by students: they signed some confidentiality agreement and joined the working discussion group, and if any fault happens, they are responsible to publish the situation according to the template and the discussion group's conversations. Yeah, I bet these conversations sometimes contain password or something else, so confidentiality is important.

Or if the tech staffs can do updates themselves, that's fine (but that's really too busy for them).

"The well-known modal" in Staytus

Choosing a open-source solution

As a student, who don't have that much money, I like open-source a lot. For this status page thing, of course I would like to solve it by open-source stuffs.

Actually according to my recongnition, there is no such "status-page service" in China. For example, Leancloud built the status page themselves. The "international" cloud version of this seems not good here, because it might be very slow. So we have to count on self-hosted, open-source ones.

In my opinion, for a status page of school IT service, the most important thing is "update". The overall "status indicator" is not that important.

Yes, this Apple style doesn't fit

After some research, the dynamic, usable, being maintained open-source status solutions are not that many.

  • Cachet, the most popular one, but not perfect for now, with PHP, MySQL/PostgreSQL (as a reminder try dev version, current stable version doesn't have status update)
  • Staytus, already elegant and perfect to use (but simple), with Ruby and MySQL, and the demo is really pretty
  • statuspage, not that popular, hasn't checked thoroughly yet, but as a Python alternative it said "Cachet is a great product, I simply despise PHP"
Cachet (dev version)

I know some of you hate databases. Using a static page generator is a good idea. These solutions exist, but they just seem not that perfect, and to form the workflow is a hard work.

  • Netlify StatusKit, though it's "a template to deploy your own Status pages on Netlify", it seems to be a generator
  • or we can make it with Jekyll and customized themes and plugins

I hope these solutions can be helpful. Though, the most important thing is still what you are trying to achieve.

发表评论

电子邮件地址不会被公开。 必填项已用*标注