The idea of writing a database query that can then go out to an external source may not occur to someone who is not a DBA early on. That is: instead of figuring out how to grab then load multiple data sets into the same store, or configuring your application backend to connect to a bunch of disparate sources, why not use query JOINs like you usually would across tables within one database?
In case you're not familiar, the dblink module in PostgreSQL, along with the concept of database links or linked servers in other DBMSs, has been around for a while. Foreign data wrappers are newer, having been introduced with PG 9+. Postgres now has a lot of foreign data wrappers available and they work with plenty of different source types: NoSQL databases, platforms like Twitter and Facebook, geospatial data formats, etc. My colleague Craig Kerstiens has shared his thoughts on Postgres being a "batteries included" database, and it's so easy to see why.
That said, why might you still need foreign data wrappers between Postgres servers? The default implementation doesn't support cross-database queries, even on the same Postgres server. So you still need the wrapper to handle the connection and fetch foreign data. postgres_fdw is more or less the dblink equivalent for access between Postgres servers, with the main difference being that postgres_fdw conforms to SQL standards. They do provide a lot of the same functionality but postgres_fdw is more recommended and more widely used at this point.
If you haven't tried postgres_fdw, setting it up is pretty simple. Say I have a contacts table on a local Postgres install, and I want to be able to query a remote Crunchy Bridge database that stores sales information. I can do the following:
- Load the extension in my local database (`postgres_fdw` is included in Postgres `contrib`, and you do need CREATE privileges on the local database):
CREATE EXTENSION postgres_fdw;
- Create a foreign server:
CREATE SERVER salesinfo_bridge FOREIGN DATA WRAPPER postgres_fdw OPTIONS (host 'p.2gdmzr2pcbadzcstrkuolxvtpq.db.postgresbridge.com', dbname 'sales');
- Set up a user mapping to authenticate:
CREATE USER MAPPING FOR postgres SERVER salesinfo_bridge OPTIONS (user 'fdw_user', password 'password');
- Then I can set up foreign tables that correspond to the tables I want to query on the foreign server. This is done in two different ways:
- Run CREATE FOREIGN TABLE, which is pretty similar to CREATE TABLE in that you have to define column names, data types, constraints etc.
- Run IMPORT FOREIGN SCHEMA, which imports tables and views from a schema, and creates foreign tables that match the definitions for the external tables. You even have the option to include/exclude specific tables only, which makes it even more convenient:
test=# IMPORT FOREIGN SCHEMA public LIMIT TO (payment_methods, accounts) FROM SERVER salesinfo_bridge INTO public;
And I can carry on querying as if these tables were all on the same database!
test=# SELECT c.id, pm.type, acct.balance FROM contacts c LEFT JOIN accounts acct ON c.id = acct.contact_id LEFT JOIN payment_methods pm ON acct.id = pm.acct_id WHERE acct.balance > 0; id | type | balance ----+------------+---------- 1 | mastercard | 2742.62 2 | mastercard | 464.76 3 | mastercard | 116.67
Even though I've recreated up the foreign schema on the local server, these tables don't actually store data locally. This means that the local server doesn't know anything about statistics from the external server (we'll look at the implications a bit later).
Don't forget about the psql commands that provide information related to foreign data wrappers:
\des- list foreign servers
\deu- list uses mappings
\det- list foreign tables
\dtE- list both local and foreign tables
\d <name of foreign table>- show columns, data types, and other table metadata
Some things to consider for query optimization
postgres_fdw will try to optimize queries on its own. With that said, the local server doesn't automatically gather statistics from the foreign server or tables either. That makes sense, but can also make query planning and optimization still somewhat involved. As a user, you might consider these approaches (the first two are described in the official postgres_fdw docs):
- Tell the foreign data wrapper that you want the foreign server to perform the cost estimate, by setting the use_remote_estimate option to true on the server or table level.
So, every time you query the foreign server, you're asking it to perform additional EXPLAIN commands. You're effectively requesting more across the network each time, which may add to a longer total time for your query to return results depending on how complex the query is. If you leave it to the default false value, the local server performs the cost estimation itself.
- Run ANALYZE on the foreign tables, which updates those table statistics on the local server. But if the foreign tables are updated pretty frequently, the local statistics can quickly become stale as well. So you'd also need to consider how often you may have to schedule ANALYZE.
- Use a materialized view on the foreign table, which you can also refresh on a desired basis. You'd be working with a local snapshot of the external data, which should help with speed. This could work well when you're not required to always use live, up-to-the-minute data.
Foreign data wrappers as an alternative to ETL?
From my non-DBA perspective, the main takeaway is that foreign data wrappers can simplify data querying and analysis when you need data from disparate sources. And the biggest drawback may be that in many cases you can't "set it and forget it," if you don't want to risk poor query performance. With that said, if you're dealing with massive amounts of data anyway, then there might be more suitable approaches. But either way, it's nice that there are other options aside from the standard ETL/ELT pattern.
For those of you out there using foreign data wrappers, what have been your most important considerations? I'm also curious to hear of other use cases for FDWs. We're all ears at @crunchydata.