EOS setup¶
We have experienced a number of issues while access EOS via the SAMBA interface. On the eventvwr of the machines:
Log Name: Microsoft-Windows-SmbClient/Connectivity
Source: Microsoft-Windows-SMBClient
Date: 10/5/2017 3:52:40 PM
Event ID: 30809
Task Category: None
Level: Error
Keywords: (64)
User: N/A
Computer: doconv01-test.cern.ch
Description:
A request timed out because there was no response from the server.
Server name: \cernbox-smb.cern.ch
Session ID:0xF13435BF
Tree ID:0xFDF9B7F1
Message ID:0x9F44C
Command: Create
Guidance:
The server is responding over TCP but not over SMB. Ensure the Server service is running and responsive, and the disks do not have high per-IO latency, which makes the disks appear unresponsive to SMB. Also, ensure the server is responsive overall and not paused; for instance, make sure you can log on to it.
--
Log Name: Microsoft-Windows-SmbClient/Connectivity
Source: Microsoft-Windows-SMBClient
Date: 10/5/2017 3:52:40 PM
Event ID: 30805
Task Category: None
Level: Warning
Keywords: (64)
User: N/A
Computer: doconv01-test.cern.ch
Description:
The client lost its session to the server.
Error: {Device Timeout}
The specified I/O operation on %hs was not completed before the time-out period expired.
Server name: \cernbox-smb.cern.ch
Session ID: 0xF13435BF
Guidance:
If the server is a Windows Failover Cluster file server, then this message occurs when the file share moves between cluster nodes. There should also be an anti-event 30806 indicating the session to the server was re-established. If the server is not a failover cluster, it is likely that the server was previously online, but it is now inaccessible over the network.
--
Log Name: Microsoft-Windows-SmbClient/Connectivity
Source: Microsoft-Windows-SMBClient
Date: 10/5/2017 3:52:40 PM
Event ID: 30807
Task Category: None
Level: Warning
Keywords: (64)
User: N/A
Computer: doconv01-test.cern.ch
Description:
The connection to the share was lost.
Error: {Device Timeout}
The specified I/O operation on %hs was not completed before the time-out period expired.
Share name: \cernbox-smb.cern.ch\eos
Session ID: 0xF13435BF
Tree ID: 0xFDF9B7F1
Guidance:
If the server is a Windows Failover Cluster file server, then this message occurs when the file share moves between cluster nodes. There should also be an anti-event 30808 indicating the session to the server was re-established. If the server is not a failover cluster, it is likely that the server was previously online, but it is now inaccessible over the network.
As our service account is connected to eosuser space we should add a new sync folder. Final configuration can be seen on next image:
On General: please clear-out "Ask confirmation before downloading folders larger than 500 MB".
Clean-up of EOS¶
Sadly the CERNBox client was not working properly e.g. INC1802776. We had some issues with contents on the far end e.g. /eos/project/d/doconverter where not removed as it was done on the local client. This was generating continous complains, specially from Indico users.
To avoid this problems, two things have been done: - Move the converter to work directly on EOS via SMB: e.g.
PS C:\Users\cdsconv> net use
New connections will be remembered.
Status Local Remote Network
-------------------------------------------------------------------------------
G: \\cern.ch\dfs Microsoft Windows Network
OK Y: \\cbox-samba-02.cern.ch\eos
Microsoft Windows Network
- An script has been written and placed at
/afs/cern.ch/user/r/rXXXX/eos-doconverter-cleanup.sh
. It's not optimal to have the script under a user's AFS home directory butcdsconv
imbox if exists, it's clearly not monitored. The script is run via acrontab:
$ acrontab -l
27 18 * * * lxplus /afs/cern.ch/user/r/rXXXXX/eos-doconverter-cleanup.sh >> $HOME/`date +\%Y\%m\%d\%H\%M\%S`-eoscleanup.log 2>&1
Contents of the script can be seen here:
#!/bin/bash
export EOS_MGM_URL="root://eosuser-internal.cern.ch"
ROOTPATH=/eos/project/d/doconverter
WHERETOLOOK=(doconverter01 doconverter02)
# Cleanup of directories
for i in ${WHERETOLOOK[*]};
do
echo Working on $ROOTPATH/$i/var/uploadsresults
eos find --childcount -d $ROOTPATH/$i/var/uploadsresults/ | grep "ndir=0 nfiles=0" | awk '{print $1}' | egrep -e "uploadsresults/[0-9]+/$" | xargs -i echo "eos rm -rf {}" | bash --noprofile --norc -x
done
# Cleanup of files
for i in ${WHERETOLOOK[*]};
do
echo Working on $ROOTPATH/$i/var
eos find -f -ctime +2 $ROOTPATH/$i/var | awk '{print $1}' | sed 's/path=//g' | xargs -i echo "eos rm {}" | bash --noprofile --norc -x
done
# Cleanup of logs files
echo Removing log files
find *-eoscleanup.log -mtime +5 -exec ls -l {} \;
find *-eoscleanup.log -mtime +5 -exec rm -f {} \;
Extra monitoring¶
Due to increase issues with SAMBA access to EOS, specially after OTG0055732, the doconverter machines sometimes get stacked on the samba mount of EOS. Sadly the monitoring commands on the server fail silently. To detect this situation I have installed a acrontab command in lxplus under my account:
$ acrontab -l
*/30 * * * * lxplus export a=$(curl -s -X GET https://doconverter.web.cern.ch/doconverter/api/v1.0/stats/4 | jq '.result'); if [ $a -gt 50 ]; then mailx -r cron@acrontab.cern.ch -s "$a documents are queueing: conversion service" weblecture-service@cern.ch < /dev/null; fi > /dev/null 2>&1
If this happens, try to disconnect and reconnect the mount point. Please stop the converter with the -s
and afterwards -r
flag. Please check Start/stop.s