I am running a code on baobab. Today, all of a sudden I was unable to cd into the directory on my account where my code is. When I ls in my home directory, I get the following message
ls: cannot access “directory name”: Communication error on send.
When I try to download some files from this directory on baobab to my local machine, I get the following error: Communication error on send
Can you advise me what to do. I am in a real panic mode, since I dont have a copy of my code anywhere else.
I have got the same issue. Two of my collaborators got it as well.
As a consolation, I suspect the data is just fine, since I also happen to have one shell which is already in the home directory, and in there I can view and even edit files. But I can not do “cd $PWD” which is funny.
It is indeed unfortunate and it seems to be quite a critical issue, affecting likely all users and potentially breaking quite some jobs.
But who knows.
One of the beegfs-meta servers segfaulted, I have restarted it (more investigation tomorrow during working time) and thus everything has been back to normal since 21:06, sorry for the inconvenience.
Two notes:
it was a software error, thus the files were safe.
Moreover, given that we were talking about ${HOME} space, there are daily backups for everything except the ${HOME}/scratch symlink
not all the users were affected, but only those who were accessing files stored on the crashed server.
However, there is no way to know in advance who is affected, given that ${HOME} (and ${SCRATCH} as well) are on BeeGFS, which is a distributed filesystem.
BeeGFS a distributed storage, user’s files are spread on various servers and big files are stored by chunk on various servers. It means that as soon as you try to access a chunk of file or a file on the faulty server, you’ll face the issue. Sooner or later everyone is accessing all the Baobab storage servers.