手机推荐排行榜,盛世嫡妃凤轻小说,完美世界有声小说

新聞中心

這里有您想知道的互聯(lián)網(wǎng)營銷解決方案

圍觀 Roblox 持續(xù)三天的故障

Roblox 是一家游戲公司，也是元宇宙概念股。去年底發(fā)生一起故障，持續(xù)三天之久，官網(wǎng)也發(fā)布blog[1]總結(jié)了原因，但并沒有說清楚底層 boltdb的問題。由于需要 FQ, 同時把官方 blog 復(fù)制了一份，歡迎圍觀 https://mytechshares.com/2022/02/16/roblox-return-to-service/

故障原因

一句話總結(jié)：Roblox 使用 consul mesh 做服務(wù)治理框架，故障前逐步放量啟用 stream 功能，consul leader 選舉和數(shù)據(jù)同步存儲使用 boltdb, 放量后恰巧觸發(fā)了 boltdb 性能問題

原因看似簡單，但是大型互聯(lián)網(wǎng)架構(gòu)本身很復(fù)雜，出問題很難第一時間定位原因，無論業(yè)務(wù)還是基礎(chǔ)架構(gòu)。同時 Roblox 使用 consul 企業(yè)版對于公司是黑盒，boltdb 問題也是事后由 hashicorp 公司定位

Boltdb 介紹

Boltdb 性能問題參考 HN 的討論[2]，修復(fù)請參考阿里巴巴的 Segregated Hashmap Patch[3]

Boltdb 是 LMDB 的移值實現(xiàn)，單機(jī)存儲引擎里算是比較挫的，目前出名開源軟件只有 etcd 在使用，并且原作者 go boltdb[4] 己經(jīng)廢棄 deprecated, 請使用 etcd 的 bboltdb[5]

etcd 之所以采用 boltdb, 主要是看中了 boltdb 是 B+tree 實現(xiàn)，支持完整的 ACID, 支持事務(wù)。從上圖可以看到，性能壓測除了 Query 50M values 全是最低的。關(guān)于 boltdb 源碼分析推薦老C的 boltdb 源碼分析系列[6]，我是他二十年老粉

性能問題

Boltdb 只有一個文件，使用 Mmap syscall 映射到內(nèi)存，并沒有使用內(nèi)存池來管理。磁盤文件以頁 Page(4KB) 劃分?jǐn)?shù)據(jù)單元，當(dāng)頁不在使用時，并不會釋放磁盤空間，而是使用 freelist 維護(hù)空閑列表，供后續(xù)使用

上圖是 Roblox 公司的數(shù)據(jù)統(tǒng)計，4G 大小的文件，大部分都是空閑的并未使用。性能問題就在于每次 B+tree 調(diào)整，分配，釋放頁時，arrayAllocate[7]算法復(fù)雜度都是 O(n)

func (f *freelist) arrayAllocate(txid txid, n int) pgid {
  ......
 var initial, previd pgid
 for i, id := range f.ids {
    ......
  // Reset initial page if this is not contiguous.
  if previd == 0 || id-previd != 1 {
   initial = id
  }

  // If we found a contiguous block then remove it and return it.
  if (id-initial)+1 == pgid(n) {
   // If we're allocating off the beginning then take the fast path
   // and just adjust the existing slice. This will use extra memory
   // temporarily but the append() in free() will realloc the slice
   // as is necessary.
   if (i + 1) == n {
    f.ids = f.ids[i+1:]
   } else {
    copy(f.ids[i-n+1:], f.ids[i+1:])
    f.ids = f.ids[:len(f.ids)-n]
   }

   // Remove from the free cache.
   for i := pgid(0); i < pgid(n); i++ {
    delete(f.cache, initial+i)
   }
   f.allocs[initial] = txid
   return initial
  }
  previd = id
 }
 return 0
}

該代碼用于從 freelist 列表 ids 數(shù)組中，尋找連續(xù)空閑的 n 頁 Page, 當(dāng) boltdb 中有大量空閑頁且不滿足需求時，都會線性遍歷全部列表，即使找到了，也要有大量的數(shù)組移動，性能影響很大

2019 年時阿里巴巴的 chenxingyu 同學(xué)調(diào)研并提交了修復(fù) segregated hashmap patch[8], 使得 boltdb 文件大小增長到 100G 也無性能衰減，Roblox 最后修復(fù)故障也是使用了這個 patch

新算法原理也很簡單，大家刷 leetcode 都知道 TwoSum 算法吧?本質(zhì)也是空間換時間，通過構(gòu)建多個 map, 索引空閑列表長度與 Page id 對應(yīng)關(guān)系。使得復(fù)雜度由 O(n) 變成 O(1)

// pidSet holds the set of starting pgids which have the same span size
type pidSet map[pgid]struct{}

type freelist struct {
  ...
    freemaps map[uint64]pidSet // key is the size of continuous pages(span), value is a set which contains the starting pgids of same size
    forwardMap map[pgid]uint64 // key is start pgid, value is its span size
    backwardMap map[pgid]uint64 // key is end pgid, value is its span size
    ...
}

主要是增加三個 Map

freemaps key 是連續(xù)空閑頁的長度，value 是 Page id 集合
forwardMap 前向 map, key 是 start pgid, value 是連續(xù)空閑長度，比如 pgid 101, value 是 3, 那么代表 [101,102,103] 均空閑
backwardMap 后向 map, 道理一樣，key 是 end pgid, value 是連續(xù)長度，比如 pgid 101, value 是 3, 那么代表 [99,100,101] 均空閑

// mergeWithExistingSpan merges pid to the existing free spans, try to merge it backward and forward
func (f *freelist) mergeWithExistingSpan(pid pgid) {
 prev := pid - 1
 next := pid + 1

 preSize, mergeWithPrev := f.backwardMap[prev]
 nextSize, mergeWithNext := f.forwardMap[next]
 newStart := pid
 newSize := uint64(1)

 if mergeWithPrev {
  //merge with previous span
  start := prev + 1 - pgid(preSize)
  f.delSpan(start, preSize)

  newStart -= pgid(preSize)
  newSize += preSize
 }

 if mergeWithNext {
  // merge with next span
  f.delSpan(next, nextSize)
  newSize += nextSize
 }

 f.addSpan(newStart, newSize)
}

mergeWithExistingSpan 用于盡可能合并空閑頁 pid, 代碼很簡單，分別查找 backwardMap, forwardMap 如果相鄰，那么就盡可能合并

func (f *freelist) addSpan(start pgid, size uint64) {
 f.backwardMap[start-1+pgid(size)] = size
 f.forwardMap[start] = size
 if _, ok := f.freemaps[size]; !ok {
  f.freemaps[size] = make(map[pgid]struct{})
 }

 f.freemaps[size][start] = struct{}{}
}

func (f *freelist) delSpan(start pgid, size uint64) {
 delete(f.forwardMap, start)
 delete(f.backwardMap, start+pgid(size-1))
 delete(f.freemaps[size], start)
 if len(f.freemaps[size]) == 0 {
  delete(f.freemaps, size)
 }
}

addSpan, delSpan 復(fù)雜度均為 O(1)

// hashmapAllocate serves the same purpose as arrayAllocate, but use hashmap as backend
func (f *freelist) hashmapAllocate(txid txid, n int) pgid {
  ......
 // if we have a exact size match just return short path
 if bm, ok := f.freemaps[uint64(n)]; ok {
  for pid := range bm {
   // remove the span
   f.delSpan(pid, uint64(n))

   f.allocs[pid] = txid

   for i := pgid(0); i < pgid(n); i++ {
    delete(f.cache, pid+i)
   }
   return pid
  }
 }

 // lookup the map to find larger span
 for size, bm := range f.freemaps {
  if size < uint64(n) {
   continue
  }

  for pid := range bm {
   // remove the initial
   f.delSpan(pid, size)

   f.allocs[pid] = txid

   remain := size - uint64(n)

   // add remain span
   f.addSpan(pid+pgid(n), remain)

   for i := pgid(0); i < pgid(n); i++ {
    delete(f.cache, pid+i)
   }
   return pid
  }
 }

 return 0
}

hashmapAllocate 用于申請連續(xù)空閑的 n 頁，復(fù)雜度是 O(1), 也有可能退化成 O(n) 去遍歷 freemaps, 從大于 n 頁的列表中申請

性能測試來自 alibaba blog[9]

評價一下

墨菲定律：如果事情有變壞的可能，不管這種可能性有多小，它總會發(fā)生。Roblox 故障持續(xù)三天，說明大規(guī)模互聯(lián)網(wǎng) IT 建設(shè)本身非常難，平時就需要做好演練，一切可能成為瓶頸的組件，一定會出問題

Etcd 2019 年這個問題就 FIX 了，但是 consul 并沒有 port 修復(fù)，基礎(chǔ)架構(gòu)從業(yè)者還是要多關(guān)注 bug fix

服務(wù)發(fā)現(xiàn)是最核心的組件，無論使用 etcd, 還是 consul mesh, 本質(zhì)還是一個 CP 系統(tǒng)，使用 raft 來確保數(shù)據(jù)一致性，但我們真的要求強(qiáng)一致嘛?答案肯定是否定的，國內(nèi)大公司很少直接用，更多的是強(qiáng)調(diào)可用性

可觀測性，隨隨變變上百個微服務(wù)，如果沒有構(gòu)建好監(jiān)控報警系統(tǒng)，出故障很難定位問題。大家在做項目時也要注意這一點(diǎn)

當(dāng)前題目：圍觀 Roblox 持續(xù)三天的故障
當(dāng)前鏈接：http://m.fisionsoft.com.cn/article/dpisicg.html

新聞中心

故障原因

Boltdb 介紹

性能問題

評價一下

其他資訊