两台vsphere平台的虚拟机故障及处理

2013.12.10早上,一台服务器虚拟机的网页上提示数据库报错(如果不是这个事11-12是去梅州旅游)。于是ssh进去,发现mysqld己经死掉,service mysqld restart无法重启mysql数据库服务,于是直接打包数据库,web应用目录,下载到本地。

重启虚拟机,无法进入系统提示“Ctrl+D”什么的,要求对磁盘进行修复,使用fsck -fy /,修复过程发现有大量的clear,心中想完了;修复完成后,进入系统发现网卡无法启动,mysqld也无法启动,发现数据大量丢失,原来根分区由占用83%变成只有30%,还有一个备份数据的分区也不见了(之前这台服务器不是我在管理,上周5就是2013.12.06下班前我才知道密码,可惜没上去看,不然还可以挽救,重启前好像己经没有看到这个分区)。后来查看/var/log/message.4,看到最早在11.10就用报磁盘故障,如kernel: sda1: rw=0, want=14014321424, limit=58701447,可能在更前就出现故障,因为message只保存最多5周的日志。再次尝试,发现又要求修复,这次修复后,发现使用密码登录这台服务器。使用单用户模式重置密码,重启还是无法登录。这台服务器的数据备份只有45天前的完整备份。在故障发生后,我们尝试了多种方法对磁盘进行修复无法成功,也尝试了多种方法对数据库进行恢复,如mysqlbinlog、mysqlcheck,但发现mysqlbin_log*也是有多个丢失,最近的mysql.bin.log也是坏的,无法使用mysqlbinlog来导出sql文件。

经过一个晚上通霄,把虚拟机文件备份,挂载到其它虚拟机,进行数据库恢复尝试无果,决定只能使用45天前数据进行恢复,因为数据完整备份在另一个同事B公司(那个同事其实是外包公司的程序员),等他过来后,才开始对数据上传到新的虚拟机上面,他之前的数据库是用phpadmin备出来的,用mysqldump导入要两到三小时,坑爹了,这前就是尝试用这个sql备份导入后,由我另一同事A(服务器一直是他在管理)整合mysqlbinlog日志数据,结果发现数据库是乱码无法使用,A之前花3小时导入数据库完成后,没有用mysqldump重新导出备份一份,如果有做这一步,经过处理的sql导出导入其实只要几分钟。晚上7点左右,决定重新导入45天sql备份,由于我第二天下午要去海珠区,为海珠区文普办举行的可移动文物普查培训班,讲授“信息登录平台和信息采集软件”的使用,晚上9:30左右我就走了,总不能两天不洗澡去讲课吧,成功逃过一劫,两同事接着弄到晚上12点多才完成。(其实这次故障不着也是一样的,只不过以后是我来管理,只好舍命陪君子)

2013.12.11早上,A同事安排服务器45天数据重新补录的事情,下午我讲课去了。

第三天(2012.12.13),下午2:49左右登录上另一台地图服务器,正准备把这台地图服务器进行数据备份,还进入mysql,show了下数据库。

刚好有些事走开几分钟,回来ssh进去发现

[root@test local]# 
Message from syslogd@ at Thu Dec 12 14:51:52 2013 ...
test kernel: journal commit I/O error
Last login: Thu Dec 12 14:49:05 2013 from 121.32.150.241
-bash: g: command not found
-bash: filex-lport: command not found
-bash: ncconfig: command not found
-bash: ncconfig: command not found
-bash: unify-adapter: command not found
-bash: unify-adapter: command not found
-bash: wilkenlistener: command not found
-bash: wilkenlistener: command not found
-bash: childkey-notif: command not found
-bash: childkey-notif: command not found
-bash: childkey-ctrl: command not found
-bash: childkey-ctrl: command not found
-bash: elad: command not found
-bash: elad: command not found
-bash: o2server-port: command not found
-bash: o2server-port: command not found
-bash: b-novative-ls: command not found
-bash: b-novative-ls: command not found
-bash: metaagent: command not found
-bash: metaagent: command not found
-bash: cymtec-port: command not found
-bash: cymtec-port: command not found
-bash: mc2studios: command not found
-bash: mc2studios: command not found
-bash: ssdp: command not found
-bash: ssdp: command not found
-bash: fjicl-tep-a: command not found
-bash: fjicl-tep-a: command not found
-bash: fjicl-tep-b: command not found
-bash: fjicl-tep-b: command not found
-bash: linkname: command not found
-bash: linkname: command not found
-bash: fjicl-tep-c: command not found
-bash: fjicl-tep-c: command not found
-bash: sugp: command not found
-bash: sugp: command not found
-bash: tpmd: command not found
-bash: tpmd: command not found
-bash: intrastar: command not found
-bash: intrastar: command not found
-bash: dawn: command not found
-bash: dawn: command not found
-bash: global-wlink: command not found
-bash: global-wlink: command not found
-bash: ultrabac: command not found
-bash: ultrabac: command not found
-bash: rhp-iibp: command not found
-bash: rhp-iibp: command not found
-bash: armadp: command not found
-bash: armadp: command not found
-bash: elm-momentum: command not found
-bash: elm-momentum: command not found
-bash: facelink: command not found
-bash: facelink: command not found
-bash: persona: command not found
-bash: persona: command not found
-bash: noagent: command not found
-bash: noagent: command not found
-bash: can-nds: command not found
-bash: can-nds: command not found
-bash: can-dch: command not found
-bash: can-dch: command not found
-bash: can-ferret: command not found
-bash: can-ferret: command not found
-bash: noadmin: command not found
-bash: noadmin: command not found
-bash: tapestry: command not found
-bash: tapestry: command not found
-bash: spice: command not found
-bash: spice: command not found
-bash: xiip: command not found
-bash: xiip: command not found
-bash: discovery-port: command not found
-bash: discovery-port: command not found
-bash: egs: command not found
-bash: egs: command not found
-bash: videte-cipc: command not found
-bash: videte-cipc: command not found
-bash: ia32e-redhat-linux: command not found
-bash: dm: command not found
-bash: uis: command not found
-bash: uis: command not found
-bash: synotics-relay: command not found
-bash: synotics-relay: command not found

磁盘报错

查看/var/log/messages

Dec 12 14:49:01 test kernel: sda1: rw=0, want=14014321424, limit=58701447
Dec 12 14:49:01 test kernel: attempt to access beyond end of device
Dec 12 14:49:01 test kernel: sda1: rw=0, want=6633534320, limit=58701447
Dec 12 14:49:01 test kernel: attempt to access beyond end of device
Dec 12 14:49:01 test kernel: sda1: rw=0, want=6999859520, limit=58701447
Dec 12 14:49:01 test kernel: attempt to access beyond end of device
Dec 12 14:49:01 test kernel: sda1: rw=0, want=5257640256, limit=58701447
Dec 12 14:49:01 test kernel: attempt to access beyond end of device
Dec 12 14:49:01 test kernel: sda1: rw=0, want=5945582256, limit=58701447
Dec 12 14:49:01 test kernel: attempt to access beyond end of device
Dec 12 14:49:01 test kernel: sda1: rw=0, want=10245271912, limit=58701447
Dec 12 14:49:01 test kernel: attempt to access beyond end of device
Dec 12 14:49:01 test kernel: sda1: rw=0, want=10240567928, limit=58701447
Dec 12 14:49:01 test kernel: attempt to access beyond end of device
Dec 12 14:49:01 test kernel: sda1: rw=0, want=5945582256, limit=58701447
Dec 12 14:49:01 test kernel: attempt to access beyond end of device
Dec  8 04:02:02 test syslogd 1.4.1: restart.

在控制台进去,满屏报错。

 

在vcenter控制台发现mysqld服务挂了,mysqld占用cpu95.7%。

这回不重启虚拟机了,决定找下专业人士来帮忙处理这个事。

2013.12.13早上,请来了广东省linux安全中心的工程师和负责我们服务器供应商的技术工程师过来协助检查,下午3:30左右开会,确认虚拟机没有入侵痕迹、没有人为误操作、vmware虚拟化平台、存储都没有问题,提出下一步自动备份计划,由我和供应商的部署工程师到电信机房现场检测确认硬件没有故障,并通知正在做“继续教育平台”的程序员也对教育平台数据进行备份。确认vmware平台和存储没有问题,6点多从电信机房回到公司,重新对这台故障的地图服务器进行部署,这台也很坑爹,有10.15左右的备份,几个工作人员加班一个月做的地图标注数据没有保存到,弄到晚上12点多才完成。

在这中间赶忙把上面的文化馆oa进行备份

mysqldump --single-transaction -uroot -p livingwork > 2013whg_oa_mysql_livingwork.sql

把web应用和数据文件打包,备份下来,并本地建立虚拟机,遇到一个Mysql大小写敏感的问题,修改my.cnf在[mysqld]下添加一行

lower_case_table_names = 0

接着重置下oa管理员密码,

mysql> update n_user set PASSWORD='iu+wbEJuB6CmcaHiSItIWNaUpzA=' where user_name='admin'; //重置admin的密码为000。

测试OK。

 

天亮回家,差的几个修改过的网页程序由于在B的公司电脑里,由B负责更新,A跟B继续,我扯呼。

 

评论(2)