These are my brain dumps from problems I worked at solving this week via trail and error.
This week I had to refactor a service to be able to write a pandas DataFrame to an MS SQL database. Pretty straight forward right? Well not quite. It's pretty easy to write a pandas DataFrame to a database. But when you want to do it with good performance a whole bunch of things come in to view. Throw in Windows Active Directory based authentication and tell me how things go for you.
So the problem starts to unfold when you attempt to use a trusted connection. When you throw integrated authentication with Windows Active Directory into the mix. First you understand what ODBC drivers are with pyodbc. There is a FreeTDS driver that does the job pretty well and handles AD authentication. But even if you're writing a non-meaningful amount of data using it, the performance is terrible.
Also, if you're using the convenience of pandas to_sql method, you'll discover that you need to pass a SQLAlchemy engine. Well you could also pass a connection, which I found out now.
To mitigate the performance issue, you can use execute_many or fast_execute. But guess what, FreeTDS driver doesn't support those.
OK, I guess I have to use the Microsoft ODBC driver for SQL Server. Here's the catch, you're running you're service in in Linux Docker container. So you gotta figure out how to install and setup ODBC driver on Linux. Fair enough. The installation instructions are pretty straight forward and it's not a biggie. Unless, your SQL Server account uses Windows AD authentication. In other words, you need to use the Trusted Connection parameter. But how do I use that with my Windows AD account credentials?
Enter Kerberos. Kerberos is the authentication mechanism you can use to talk to a Windows AD server and get a ticket. This ticket is picked up by the Microsoft ODBC SQL driver.
You need to setup the kerberos client on Linux, configure the hostnames of key machines you need to talk to, setup the configuration and initiate the connection. If you want to use this in a service, learn about keytabs. Also, you might encounter issues using the OBDC driver if it doesn't play nice with you default OpenSSL configuration. True story.
But.. Kerberos tickets expire and either need renewal for the service to function. Now you also need to handle these error condition on your service program logic as well. You need to setup a mechanism to start the ticket request when the container starts. You need to confirm that the ticket is valid before using the database connection and renew the ticket if not.
Imagine you figured all this out. Now it's time to package this as an API. FastAPI. But FastAPI is asynchronous. You need to make sure your service doesn't create a shit load of connections and overwhelm the database and create locks that aren't released that need to go through the DBA to clear up who's in a very different time zone (again, True Story). So you use some fancy dependency injection to make sure you create the SQLAlchemy engine once which is used for handling all requests. Don't forget that you need to make sure the Kerberos ticket is valid before you talk to the database though.
So with all this, you also schedule a cron job that will periodically renew the ticket.
That will solve it! But here's some other things to consider. How about your service needs to connect to many database servers that each use a different credentials to connect (because security). But because you're managing kerberos tickets on the OS level not within your program, you have to figure something else out.
Other rabbit holes explored, reading MS ODBC driver connection string keywords and attributes to see if it support Windows AD authentication directly from the connection string instead of relying on kerberos (it doesn't yet, but for Azure AD it does!). Also, understanding how SQLAlchemy works. What is the difference between Engine, Connection and Session?
You get the idea. Well I learned a ton this week. Did you?