EOS setup¶

We have experienced a number of issues while access EOS via the SAMBA interface. On the eventvwr of the machines:

Log Name:      Microsoft-Windows-SmbClient/Connectivity
Source:        Microsoft-Windows-SMBClient
Date:          10/5/2017 3:52:40 PM
Event ID:      30809
Task Category: None
Level:         Error
Keywords:      (64)
User:          N/A
Computer:      doconv01-test.cern.ch
Description:
A request timed out because there was no response from the server.

Server name: \cernbox-smb.cern.ch
Session ID:0xF13435BF
Tree ID:0xFDF9B7F1
Message ID:0x9F44C
Command: Create

Guidance:
The server is responding over TCP but not over SMB. Ensure the Server service is running and responsive, and the disks do not have high per-IO latency, which makes the disks appear unresponsive to SMB. Also, ensure the server is responsive overall and not paused; for instance, make sure you can log on to it.
--

Log Name:      Microsoft-Windows-SmbClient/Connectivity
Source:        Microsoft-Windows-SMBClient
Date:          10/5/2017 3:52:40 PM
Event ID:      30805
Task Category: None
Level:         Warning
Keywords:      (64)
User:          N/A
Computer:      doconv01-test.cern.ch
Description:
The client lost its session to the server.

Error: {Device Timeout}
The specified I/O operation on %hs was not completed before the time-out period expired.

Server name: \cernbox-smb.cern.ch
Session ID: 0xF13435BF

Guidance:
If the server is a Windows Failover Cluster file server, then this message occurs when the file share moves between cluster nodes. There should also be an anti-event 30806 indicating the session to the server was re-established. If the server is not a failover cluster, it is likely that the server was previously online, but it is now inaccessible over the network.

--
Log Name:      Microsoft-Windows-SmbClient/Connectivity
Source:        Microsoft-Windows-SMBClient
Date:          10/5/2017 3:52:40 PM
Event ID:      30807
Task Category: None
Level:         Warning
Keywords:      (64)
User:          N/A
Computer:      doconv01-test.cern.ch
Description:
The connection to the share was lost.

Error: {Device Timeout}
The specified I/O operation on %hs was not completed before the time-out period expired.

Share name: \cernbox-smb.cern.ch\eos
Session ID: 0xF13435BF
Tree ID: 0xFDF9B7F1

Guidance:
If the server is a Windows Failover Cluster file server, then this message occurs when the file share moves between cluster nodes. There should also be an anti-event 30808 indicating the session to the server was re-established. If the server is not a failover cluster, it is likely that the server was previously online, but it is now inaccessible over the network.

This produces some Inestabilities on the application. This is why nowadays we are running using EOS sync client. While installing please pay attention not to select "Integration for Windows Explorer" (see image)

As our service account is connected to eosuser space we should add a new sync folder. Final configuration can be seen on next image:

On General: please clear-out "Ask confirmation before downloading folders larger than 500 MB".

Clean-up of EOS¶

Sadly the CERNBox client was not working properly e.g. INC1802776. We had some issues with contents on the far end e.g. /eos/project/d/doconverter where not removed as it was done on the local client. This was generating continous complains, specially from Indico users.

To avoid this problems, two things have been done: - Move the converter to work directly on EOS via SMB: e.g.

   PS C:\Users\cdsconv> net use
New connections will be remembered.


Status       Local     Remote                    Network

-------------------------------------------------------------------------------
             G:        \\cern.ch\dfs             Microsoft Windows Network
OK           Y:        \\cbox-samba-02.cern.ch\eos
                                                Microsoft Windows Network

An script has been written and placed at /afs/cern.ch/user/r/rXXXX/eos-doconverter-cleanup.sh. It's not optimal to have the script under a user's AFS home directory but cdsconv imbox if exists, it's clearly not monitored. The script is run via acrontab:

$ acrontab -l
27 18 * * * lxplus /afs/cern.ch/user/r/rXXXXX/eos-doconverter-cleanup.sh >> $HOME/`date +\%Y\%m\%d\%H\%M\%S`-eoscleanup.log 2>&1

Contents of the script can be seen here:

#!/bin/bash

export EOS_MGM_URL="root://eosuser-internal.cern.ch"
ROOTPATH=/eos/project/d/doconverter
WHERETOLOOK=(doconverter01 doconverter02)

# Cleanup of directories
for i in ${WHERETOLOOK[*]};
do
        echo Working on $ROOTPATH/$i/var/uploadsresults
        eos find --childcount -d  $ROOTPATH/$i/var/uploadsresults/ | grep "ndir=0 nfiles=0" |  awk '{print $1}' |  egrep -e "uploadsresults/[0-9]+/$" | xargs -i echo "eos rm -rf {}" | bash --noprofile --norc -x
done

# Cleanup of files
for i in ${WHERETOLOOK[*]};
do
        echo Working on $ROOTPATH/$i/var
        eos find -f -ctime +2 $ROOTPATH/$i/var |  awk '{print $1}' | sed 's/path=//g' |  xargs -i echo "eos rm {}" | bash --noprofile --norc -x
done

# Cleanup of logs files
echo Removing log files
find *-eoscleanup.log -mtime +5 -exec ls -l {} \;
find *-eoscleanup.log -mtime +5 -exec rm -f {} \;

This is a "temporary" measure till things get back to normal in EOS.

Extra monitoring¶

Due to increase issues with SAMBA access to EOS, specially after OTG0055732, the doconverter machines sometimes get stacked on the samba mount of EOS. Sadly the monitoring commands on the server fail silently. To detect this situation I have installed a acrontab command in lxplus under my account:

$ acrontab -l
*/30 * * * * lxplus export a=$(curl -s -X GET  https://doconverter.web.cern.ch/doconverter/api/v1.0/stats/4 | jq '.result'); if [ $a -gt 50 ]; then mailx -r cron@acrontab.cern.ch -s "$a documents are queueing: conversion service" weblecture-service@cern.ch < /dev/null; fi > /dev/null 2>&1

If this happens, try to disconnect and reconnect the mount point. Please stop the converter with the -s and afterwards -r flag. Please check Start/stop.s

Last update: October 4, 2021